Saturday, June 23, 2012

ClojureCLR 1.4 released with code gen redone

I just pushed version 1.4.0 of ClojureCLR.

This version matches all fixes and enhancements in the same version of Clojure/JVM/TheMotherShip.  It also  sports, at long last, the new non-DLR-based code generation phase for the compiler.  My earlier post titled "Code gen redo preview".  The benefits outlined in that post--smaller assemblies, faster startup, some speedup--are now available in this release.

In the preview post, I mentioned attempting to re-implement 'light compilation' in the new compiler.  The previous  (DLR-based) version of the compiler has a special compilation mode used during evaluation (but not during AOT-compilation) for any function that does not need to have its own class defined.   It used DynamicMethods, a lighter weight version of IL generation that can be significantly faster than methods generated to assemblies.

The way the DLR generated DynamicMethods allowed live constants to be embedded in a way that full compilation did not; as a result, there were some constructs that would load evaluated but would not AOT-compile.  It never happened to me, but I know of one person who got bit by that (and who asked me to leave light compilation out of this version).  That person need not worry -- I have not yet finished coding light compilation and it is not active in this release.

Does it matter?  Most likely not.  You could see some speedup loading very large files, such as clojure/core.clj.  However, I'm pretty sure light compilation was massively speeding up running the Clojure test suite.  For some reason, with the current version, if you start up ClojureCLR under the debugger and run the full test suite, msvsmon.exe, the Visual Studio Remote Debugging Monitor, goes absolutely nuts allocating memory.  It appears to max at about 6GB on my PC with 8GB.  One sees some serious thrashing at that point.  I have no idea why.  I have no problem compiling the entire clj source bootstrap code, which starts off evaluating core.clj and friends first and then going back to compile it, with not a hiccup from msvsmon.

So, at present, I run the test suite from the command line.  If I need to debug on the tests, I just load the test file in question and run the tests for it only.

If you see this kind of memory-hogging behavior in mvsmon.exe, let me know.

The near-term roadmap for ClojureCLR development, in no particular order, is:

  1. Catch up with Clojure 1.5.x-alpha changes, particularly the new work on reducers.
  2. Examine the perf of the test suite and see what's going on there.
  3. Work on a single assembly distribution of ClojureCLR (using nested assembly loading), following the trail blazed by aaronc and Ralph Moritz.
  4. Work on running ClojureCLR on Mono.  (Some very positive reports from Robert Johnson on his experiments).
  5. (Maybe) finish implementing light compile, for my own sense of fulfillment if nothing else.
The work on reducers on the JVM side has introduced version-specific implementation (using Fork/Join in JVM 7 or falling back to other means on earlier versions).   I suspect a similar 3.5/4.0 .Net distinction could be made to take advantage of the Task Parallel Library (or whatever it is called these days).  If you've played with TPL under ClojureCLR, I'd love to hear about it.

Cheers.






Monday, March 26, 2012

Code gen redo preview

Rewriting the code generation phase of a compiler is not for the faint of heart. Nor, perhaps, for the sound of mind.

I've nearly completed a rewrite of the code gen code of the ClojureCLR compiler.  There are still a few things on my punch list (see below), but all the clojure.test-clojure.* tests run now.  I hope an intrepid few will give it a spin before I push the changes to master.  The new code can be found in the nodlr branch in the github repo.

When I wrote the ClojureCLR compiler, I was interested in seeing what kind of mileage I could get out of the Dynamic Language Runtime, specifically the DLR's expression tree mechanism.  The DLR's ETs extended the ETs used in Linq by providing enhanced control flow capabilities.  They are central to other dynamic language implementations on the CLR such as IronPython and IronRuby.

The first version of the ClojureCLR compiler mimicked the JVM compiler through its initial phases.  The Lisp reader translates source code to Lisp data structures that are parsed to generate abstract syntax trees.  The ClojureJVM compiler traverses the ASTs to produce JVM bytecodes.  The ClojureCLR compiler instead generates DLR ETs from the ASTs.   Those ETs are then asked to generate MSIL code.

I got a lot of mileage out of using the DLR for code generation.  I got to avoid some of the hairier aspects of MSIL and the CLR-- things like value types, generic types, nullable types, for example, are handled nicely by ETs.  I also found it easier  to experiment.  However, using ETs  had at least two drawbacks. One was that going from ASTs to MSIL through ETs likely nearly doubles the work of MSIL generation. Another was that ETs were restricted to producing static methods only.  Working around this restriction introduced several inefficiencies in the resulting code.

The Clojure model for functions maps each defined function to a class.  For example, compiling

(defn f
  ([x] ... )
  ([x y] ... ))

yields a class named something like user$f__1295 that implements clojure.lang.IFn, with overrides for virtual methods invoke(Object) and invoke(Object,Object).  (The actual value to which f would be bound would be an instance of this class.)

Note that the invoke overrides of necessity are instance methods.  Recall from above that DLR ETs cannot produce instance methods.  Toss in a another little problem referring to unfinished types.  Shake and stir.  You end up with the following abomination:  Where Clojure/JVM generates one class and two methods for the example above, ClojureCLR would have to generate two classes and four methods.  An invoke override method is just a passthrough to a DLR-generated static method taking the function object as a first paramenter.

For several years I hoped that the DLR team would get around to looking into class and instance method generation.  This now seems unlikely.  So I finally decided to rewrite the code generation phase to eliminate most uses of the DLR.

The new code gen code yields significant improvements in compilation time and code size.  Compiling the bootstrap clojure core environment is roughly twice as fast. The generated assemblies are about 20% smaller.  Startup time (non-NGEN'd) is 11% faster.  A few benchmarks I've run show speedups ranging from 4% to 16%.  This is in line with my best hopes.

One other benefit: with code generation more closely modeled after the JVM version, future maintainers will need less knowledge of the DLR.

There are drawbacks to this move.  The DLR guys know a lot more about about generating MSIL code than I do.  Some wonderful goodness with names like Expression.Convert and Expression.Call were my best friends    They are (mostly) gone now.   And, oh, the beauty of DebugView for ETs for debugging code gen--this will be missed.  My new best friends are peverify and .Net Reflector, the caped duo for rooting out bad MSIL.   Wonderful in their own way, but I have a sense of loss.

So, where are we?  I have a little more work to do before putting this on the master branch.    I plan to make one last traversal of the old code looking at all occurrences of my former best friends  to make sure I've been consistent in handling the complexities they hid.  I also plan to reimplement a 'light compile' variation to be used during evaluation.  The current version has it.  (What this is and why it matters I leave to another time.)  Neither task will take long.

In the meantime, the nodlr branch is ready for a workout by those who care and dare. Have at it.


P.S.  The DLR is still being used to provide CLR interop and polymorphic inline caching.  Another topic-for-another-day.


Friday, February 3, 2012

vsClojure: It's alive!

vsClojure is alive (again)!

The ability to do Clojure(CLR) development in a Visual Studio context has been a fairly constant demand.  vsClojure,  a VS extension supporting ClojureCLR projects, had been fulfilling this need. 
However, the project had gone quiet for a while and dormancy does not inspire confidence in the OSS world.

Jon, AKA jmis, the author of vsClojure, and I discussed how to move vsClojure forward.  The upshot is that Jon will continue contributing to vsClojure and we''ll transition the maintainer role to me.  What that's meant so far: Jon's been working hard these last few weeks mowing the lawn and pulling some weeds.  I've been applauding his efforts.  Time to invite the neighbors over for a lawn party.  Here's what we're celebrating:

vsClojure has a new home.    The repo has been moved to https://github.com/vsclojure/vsclojure.  The README there has instructions for installing and for building from source.  (Installation is easy:  use the VS extension manager to pull vsClojure from the Visual Studio Gallery.)

vsClojure has a new release. Several outstanding issues were closed.  The big advance is that ClojureCLR 1.3 is now supported.

If you have the old version of vsClojure installed, you will see a notification of vsClojure's new home  if you update the deprecated version.  (I'm not sure if VS will automatically notify you.)


vsClojure is being actively developed.   Jon and I are working on a development plan for enhancements to vsClojure.    Features currently supported include:

  • Clojure project type
  • Building and running clojure projects 
  • Clojure source editor
  • Syntax highlighting
  • Brace matching
  • Auto-indentation
  • Source formatting 
  • Block commenting
  • Hippie completion
  • Integrated REPL 
  • Load all project files into REPL 
  • Load active editor file into REPL
  • Switch to active file's namespace 
  • History 
Let us know what features you'd like to see added or what needs work.   You can create issues on the github repo.  Feel free to start discussions on the discussion group. And, of course, feel free to dive in and hack away.

My thanks to Jon for all the effort he's put in to vsClojure to date.  I'm even more thankful he's willing to keep going.  I'm looking forward to it.


Monday, January 23, 2012

Compiling and loading in ClojureCLR

Wherein I document environment variables and other factors influencing compiling and loading files in ClojureCLR and how ClojureCLR differs from Clojure in this regard.

Compiler variables

During AOT-compilation, the following vars are consulted to control aspects of the compilation process:

Vardoc says
*compile-path*
Specifies the directory where 'compile' will write out .class files. This directory must be in the classpath for 'compile' to work. Defaults to "classes"
*unchecked-math*
While bound to true, compilations of +, -, *, inc, dec and the coercions will be done without overflow checks. Default: false.
*warn-on-reflection*
When set to true, the compiler will emit warnings when reflection is needed to resolve Java method calls or field accesses. Defaults to false.

If you compile by invoking the compile function, such as from a REPL, you will have had a chance to set these vars to appropriate values. However, when compiling from the command line by running Clojure.Compile.exe, you do not have a chance to run Clojure code to initialize these vars. Instead, you can set environment variables to initialize these vars prior to compilation.

The same is true for Clojure. In fact, ClojureCLR and Clojure used the same environment variables for these variables until just recently. Starting with the 1.4.0-alpha5 release (already in the master branch), ClojureCLR has changed the environment variable names to be strict POSIX-compliant. This is due to problems with periods in environment variable names in Cygwin's bash -- see this thread for more information. Here are the names:

Clojure & older ClojureCLRnew in ClojureCLR
clojure.compile.pathCLOJURE_COMPILE_PATH
clojure.compile.unchecked-mathCLOJURE_COMPILE_UNCHECKED_MATH
clojure.compile.warn-on-reflectionCLOJURE_COMPILE_WARN_ON_REFLECTION

BTW, ClojureCLR defaults *compile-path* to ".".  "classes" didn't seem to make sense given that ClojureCLR creates assemblies.

Locating files

For identifying libraries for loading, Clojure relates the symbol naming the library to a Java package name and uses Java's mapping of package name to a classpath-relative path. For example, evaluating (compile 'a.b.c) causes Clojure to look for a file a/b/c.clj relative to some root listed on the classpath.  The result of the compilation will be a set of classfiles, written to classes/a/b/c.

ClojureCLR follows Clojure in mapping dotted symbol names to relative paths.  Not having classpaths, ClojureCLR instead uses the value of the environment variable CLOJURE_LOAD_PATH to supply roots for the file probes. In addition, it will look (first) in the current directory and directory of the entry assembly.

The same holds for load, use, require and other lib-loading functions.

Assembly output

The Clojure compiler outputs (many) class files.  The ClojureCLR compiler outputs (not as many) assemblies.  All classes resulting from (compile 'a.b.c) will go into an assembly named a.b.c.clj.dll located in *compile-path*.

When evaluating (load "a/b/c"),  ClojureCLR will look for both <AppDomain.CurrentDomain.BaseDirectory>\a.b.c.clj.dll and <any_load_path_root>\a\b\c.clj, and load the assembly if it exists and has a timestamp newer than the .clj file (if it exists).  At the moment the same set of roots (as named above) is used for assemblies and source code.  

AppDomain.CurrentDomain.BaseDirectory is used as the root for ClojureCLR assembly probes as that is also the CLR's root for resolving assembly references.  

Too many assemblies

Each file loaded during compilation will go into its own assembly.  I find this terribly inelegant.  The distribution for ClojureCLR itself needs Clojure.Main.exe, Clojure.Compile.exe, and the DLR support assemblies, of course, but also thirty-plus assemblies resulting from compiling the Clojure source that defines the initial environment. The pprint lib alone contributes eight assemblies.  They are not really independent.   Conceivably that code all could go into one assembly.  

I've not been able to think of a way to make this work.  I know that the eight files making up pprint are related.  They get compiled because the main pprint file loads each of them, and loading a file while compiling cause that file to be compiled also.  I could very easily write the compiler to output the code into the same assembly as the parent.  However, pprint could load support code that should not be part of its assembly, that should have its own assembly.  In fact, it does;  pprint loads clojure.walk.  It happens to do this with a :use clause in its ns form, but it doesn't have to.  Without a mechanism in Clojure that allows us to distinguish these uses of load, I'm afraid we're stuck with some inelegance.

Tuesday, January 17, 2012

Porting effort for Clojure contrib libs

Looking for Clojure contrib lib projects to port to ClojureCLR?

I looked at the most popular libs on https://github.com/clojure, the official libs of the clojure project.  I defined popularity by the number of watchers, lacking a better criterion.  Here are the top projects sorted by number of watchers when I looked recently.  Ignoring those in single digits and all java.* projects, here they are:

WatchersProjectWatchersProject
129
core.logic
23
test.generative
69
core.match
20
core.cache
60
tools.nrepl
19
core.memoize
37
tools.cli
18
algo.monads
36
data.finger-tree
15
data.xml
35
tools.logging
11
test.benchmark
32
core.unify
10
core.incubator
28
data.json
10
data.csv
10
tools.macro

There are some fairly trivial edits that are required in porting most libs.  These include:

  1. Substituting an appropriate CLR exception class.  For example, InvalidArgumentException becomes ArgumentException.  If a throw uses Exception, that will work as is.
  2. Substituting interop method names.  For example, toString becomes ToString, hashCode becomes GetHashCode, etc.  Most String methods and some I/O methods just need capitialization.  BTW, ClojureCLR preserves case on most clojure.lang class method names so they don't need to be changed.  (You're welcome.)  Also, method names on protocols won't need to be changed.

I'll refer to these kinds of changes below as the usual.

I did a quick scan of the source of each project to estimate the effort required to port the project to ClojureCLR. In the order given above, here are some comments on each.

core.logic: This is one of the larger projects.  The usual, and not that much of it.  The only thing I saw that might take a little more investigation is that the deftype Pair implements java.util.Map$Entry.  (See below for more.)  Easy. (Unless it requires actual thought, in which case you'd have to understand the code, and that would make it a Challenge.)

core.match:  Another large project.  The usual, and not much of it.  The bean-match function will require adaptation to CLR classes and the regular expression matcher will need to be examined -- JVM vs CLR regexes always requires a look.  Of most concern is the deftype MapPattern that mentions java.util.Map.  The question is always dealing with IDictionary and IDictionary<K,V> -- support for arbitray generics is always tricky.  Probably Easy, with the same caveat as core.logic.

tools.nrepl: This is likely to be tricky.  There are some Java classes that will have to be ported.  Of greater concern is the amount of low-level I/O on sockets.  At best, a Medium project, likely a Challenge.  Given that this project is being redesigned, it might be wise to wait for 2.0 and then put in the effort.

tools.cli: The uusual, and not much of it.  There is a test that uses an Integer method.  Trivial.

data.finger-tree:  The usual.  The only concern is the mention of java.util.Set.  There is no System.Collections.ISet, only System.Collections.Generic.ISet<T>, so some thought will be required. At worst, Medium; more likely Easy.

tools.logging: This will take some work because adapters for .Net logging tools will have be developed.  One might consider log4net, ELMAH, NLog.  The good news is that the code is designed to plug different adapters into its framework, so developing new adapters should be easy, requiring mostly a decent knowledge of the target logging framework.  Most of the tests will have to be rewritten.  Medium, probably fun.

core.unify: The usual.  The same concern about java.util.Map mentioned for core.match.  I'm guessing this is trivial here.  Easy.

data.json: We know exactly how much work this will take.  See Porting libs to ClojureCLR: an example.

test.generative: Needs tools.namespace.  That didn't make the popularity cut, but it should be barely Medium to port, mostly due to the need to think a little about the I/O interop.  In test.generative, there are some library calls, to Random, Math.* methods, system time, etc., that will take a little more work than just the usual.  Barely Medium.

core.cache: A moment's thought about replacing java.lang.Iterable in the definition of defcache.  Otherwise, just the usual.  Easy.

core.memoize: Needs core.cache.  Might work as-is!  Trivial.

algo.monads: Might work as-is!  Trivial.  Hey, when was the last time you saw 'trivial' and 'monads' in such proximity?

data.xml:  The README notes that is is not yet ready for use.  Really, this should be called java.xml because of its dependence on org.xml.sax, java.xml.parsers, etc.  This will require a major rewrite.  Until this is complete, I can't say how hard it will be.

test.benchmark: Looks straightforward.  Easy.

core.incubator: The toughest thing is reference to java.util.Map (see above).  Trivial.

data.csv: The I/O will take some time, but at worst a Medium.  A very Easy Medium at that.

tools.macro: Appears to be Trivial.

So, what are you waiting for.  Plenty of easy ones to get started with and a few more challenging ones.  Whatever you pick, you'll have a chance to read some good Clojure code, always a worthwhile exercise.

Where are the hard ones, you ask?  They certainly exist, just not among the official contrib libs.  There are plenty of other Clojure projects floating around that will require significant effort.

Port a lib today!

A note on java.util.Map$Entry:  clojure.lang.IMapEntry extends java.util.Map$Entry on the JVM. ClojureCLR could not do that because the equivalent to Map$Entry, System.Collections.DictionaryEntry, is a struct and can't be subclassed. Also, we have the problem with the generic System.Collections.Generic.KeyValuePair<TKey,TValue>. I shudder when I see Map$Entry; this is a sign that real thinking will be required.

Friday, January 6, 2012

Porting libs to ClojureCLR: an example

Responses to the 2011 ClojureCLR survey gave high priority to the porting of Clojure contrib libs to ClojureCLR.  One action item resulting from an analysis of the responses was to provide examples of the porting process.  As an example, I decided to port the data.json contrib lib, authored by Stuart Sierra.  In looking across all the authorized contrib libs on github.com/clojure, this port was above trivial but below daunting in complexity.  (In a later post, I will analyze all the contrib libs for complexity.)

The following recipe will work for simple ports.

Figure out how the original code works.  This step might be optional on the simplest ports involving just simple interop method renaming.  For this port, I had to modify one algorithm and think through extending a protocol to CLR types.

Create a project for the port.  I did not fork the original data.json project on github; the project needs its own identity and you're not going to be issuing pull requests back to the original.

You are on your own for picking project names and namespace structures for these ports.  I chose the name cljclr.data.json for both the project and the namespace.  You will find my project here.

I would appreciate input on naming these projects.  Here's my current thoughts.  I didn't want to call  this project clr.data.json--I plan to use clr.X.X for ports of contrib libs that have names like java.X.X.  The namespace for data.json is actually clojure.data.json, so I decided to use cljclr.data.json.  I thought of data.json.clr and other variations adding clr as a component, but was offended by having on extra layer of subdirectory injected.  I don't actually like cljclr--I can't read it and I always mistype it.  Help!
In the absence of tools such as leiningen or a working Visual Studio extension, for now you will have to come up with your project structure.  I just used

  • <root>
    •  src
    •  test

as the two main branches, with subdirectories corresponding to the namespaces.

Copy files from the source project.  Usually, you can preserve the basic code structure, relocating files appropriately for your namespace structure.  The data.json project really has only two files of significance:  json.clj and json_test.clj.  I copied them into src/cljclr/data/json.clj and test/cljclr/data/json_test.clj.

Modify file headers. I changed the ns directive in each file to match my namespace structure. If you are interested in things like copyrights, you will have figure out how you want to acknowledge the original author according the license on the original project.

Scan for trouble.  I look for trouble.  Here are some of the things I look for.

  • Imports in the ns directive. Imports of clojure.lang classes are usually okay--I've been carerful to maintain class names and public method names to match Clojure as best I can.  Anything starting  java. will have to be replaced.
  • Interop calls. Scan for (.name,  (name.,  and (CapName/name.  If you are lucky, a simple capitalization will handle many of the (.name calls.
  • Type hints.  Most type hints of classes outside clojure.lang will need to be changed.  Carats feel like sticks.
  • Exceptions.  The only exception class that works unchanged is Exception itself.
  • CapitalLetters. Given that Clojure code is mostly lowercase, anything with CapitalLetters is likely to be trouble.
Do simple renamings.  Hack out the bad imports and add new ones as you work through the code.  Work through each interop call and determine equivalents; ditto for type hints and exceptions.

I adopt a uniform method of marking the places where I make changes.  This makes it easier to update the port as the original lib changes.  You will see lines like this in my code:

 (Char/IsWhiteSpace c) (recur (.Read stream) result)             ;DM: Character/isWhitespace .read   

where the comment shows what was changed from in the original code.

For json.clj, I made the following simple changes.

Method renamings:

FromToCount
(.read(.Read
30
(.append(.Append
14
(.print(.Write
11
(.unread(.Unread
3
(Character/isWhitespace(Char/IsWhiteSpace
3
(.isArray(.IsArray
2
(.charAt(.get_Chars
1
(.toString(.ToString
1

Type renamings:

FromToCount
PushbackReaderPushbackTextReader
9
PrintWriterTextWriter
5
PrintWriterStreamWriter
1
CharSequenceString
1
EOFExceptionEndOfStreamException
5

PrintWriter was mostly used as a type hint, and could have TextWriter substituted.  In one place, it was used new'd.  TextWriter is abstract and can't be constructed, so StreamWriter had to be used in that place.

This project is obviously quite I/O-centric.  Fortunately, there is sufficient commonality between the I/O libraries on the JVM and CLR to make this part of the translation fairly routine.

At this point, you are probably very close to being able to load the file. You can try loading or compiling and let the errors guide you.

Check for reflection warnings.  I find it useful to  (set! *warn-on-reflection* true) at the start of each code file.  I don't do this for performance, but it can catch bad type hints and misnamed methods in the code.   The perf I care about is coding perf, not runtime.

Do the hard parts. The remaining changes were not as easy.  There were two other type renamings, not I/O-related.
java.util.Map => System.Collections.IDictionary
java.util.Collection => System.Collections.ICollection
If you see Java collection types, you are going to have think about the correct alternatives.  In general, we have no way to deal with CLR generic collection types -- most uses of these type names will require the type parameters to be instantiated, so you can't say you want all System.Collections.GenericICollection<T> to be handled. However, most of the generic collection types will provide non-generic interfaces to work with.  That is the solution I chose for this code:

(defn- pprint-json-dispatch [x escape-unicode]
 (cond (nil? x) (print "null")
       (instance? System.Collections.IDictionary x) 
                        (pprint-json-object x escape-unicode)        ;DM: java.util.Map
       (instance? System.Collections.ICollection x) 
                        (pprint-json-array x escape-unicode)         ;DM: java.util.Collection
       (instance? clojure.lang.ISeq x) 
                        (pprint-json-array x escape-unicode)
       :else (pprint-json-generic x escape-unicode)))

Another common trouble spot is dealing with the Java primitive wrapper classes, such as Integer.  You mostly  like won't see many type hints, but you will see calls to static methods.  There is only one in this code.  In read-json-hex-character, I translated (Integer/parseInt s 16) to  (Int32/Parse s NumberStyles/HexNumber).  (I added an import for System.Globalization.NumberStyles.)  I don't have a list of rules for this.  You're on your own.

Extending protocols to Java library classes will also cause you to spend some time thinking.  The primary difficulty here came in extensions to the Write-JSON protocol.  Extensions to things like nil and clojure.lang.Named were unchanged.  Other extensions had to be modifed.

You will have a problem anytime you run into java.lang.Number.  This is a base class for all the primtive numeric wrapper classes such as Integer.  There is no equivalent in the CLR. I created extensions for each primitive numeric type.

(extend System.Byte Write-JSON {:write-json write-json-plain})
(extend System.SByte Write-JSON {:write-json write-json-plain})
(extend System.Int16 Write-JSON {:write-json write-json-plain})
...

Also, java.math.BigInteger and java.math.BigDecimal will need to be translated to clojure.lang.BigInteger and clojure.lang.BigDecimal, and CharSequence usualy goes to String.

The only significant algorithmic change was in write-json-string.  Stuart was careful to properly handle full Unicode, which means not just iterating through the string character by character but dealing with actual Unicode code points as expressed in the UTF-16 encoding used in Java strings.  CLR strings use the same encoding, but the proper way to iterate through Unicode code points is different.  This took the most time for me to translate due to my own ignorance.  I ended up with

(defn- write-json-string [^String s ^PrintWriter out escape-unicode?]
  (let [sb (StringBuilder. ^Integer(count s))]
    (.append sb \")
    (dotimes [i (count s)]
      (let [cp (Character/codePointAt s i)]
        (cond
         ;; Handle printable JSON escapes before ASCII
         (= cp 34) (.append sb "\\\"")
...
         (< 31 cp 127) (.append sb (.charAt s i))
...
         :else (if escape-unicode?
   ;; Hexadecimal-escaped
   (.append sb (format "\\u%04x" cp))
   (.appendCodePoint sb cp)))))
 ...

becoming

(defn- write-json-string [^String s ^TextWriter out escape-unicode?]
  (let [sb (StringBuilder. ^Int32 (count s))
        chars32 (StringInfo/ParseCombiningCharacters s)]
    (.Append sb \")
    (dotimes [i (count chars32)]
      (let [cp (Char/ConvertToUtf32 s (aget chars32 i))]
        (cond
         ;; Handle printable JSON escapes before ASCII
         (= cp 34) (.Append sb "\\\"")
...
         (< 31 cp 127) (.Append sb (.get_Chars s i))
...
         :else (if escape-unicode?
   ;; Hexadecimal-escaped
   (.Append sb (format "\\u%04x" cp))
   (.Append sb  (Char/ConvertFromUtf32 cp))))))
...

Translate your tests.  This is generally easier.  The only changes I had to make were to some setup code for certain tests.  Only a handful of changes were required.

Test. Repair. Repeat.  Good luck.

After just a few iterations, I had all tests passing except one.  In the pretty-print test, I was getting output for some floating point number that were exact integers without a .0 at the end as was required by the test.  Having run into this before, I recalled that I had defined a helper function name fp-str to take care of this.

(defn- write-json-float [x ^TextWriter out escape-unicode?] 
  (.Write out (fp-str x)))                                  

and then modifed the two float-type extends:

(extend System.Double Write-JSON {:write-json write-json-float})
(extend System.Single Write-JSON {:write-json write-json-float})

GREEN!!!!

Celebrate.  
Publish.

I leave these steps to you.


Next steps? Choose something to port and go for it.  


Many of the contrib libs will require significantly less work than this.  There are others that will require almost complete rewrites.  Outside of the core contrib libs, many of the requested ports such as leiningen and the web frameworks are going to a lot of work.

Resources

* data.json:  https://github.com/clojure/data.json
* clrclj.data.json -- https://github.com/dmiller/clrclj.data.json

Tuesday, January 3, 2012

Referring to types

The basics for referring to types for CLR interop are the same for ClojureCLR as for Clojure on the JVM. I will assume you are familiar with interop as covered in http://clojure.org/java_interop or your favorite Clojure intro.

Standard Clojure allows use of the symbols int, double, float, etc. in type hints to refer to the corresponding primitive types. ClojureCLR allows this and extends this to the numeric types present in the CLR but not in the JVM: uint, ulong, etc. Similarly, the shorthand array references such ints and doubles work, and are joined by uints, ulongs, etc.

The CLR is not C#.
Do not let the presence of int and company for type hinting put you in a C# frame of mind. When specifying generic types, you cannot use C# notation:

System.Collections.Generic.IList<int>

Instead, you must use the actual CLR type name:

System.Collections.Generic.IList`1[System.Int32]

Remember that floatbecomes System.Single; I've had to dope-slap myself on that one a few times.

Clojure uses symbols to refer to types. This works on the JVM because package-qualified class names are lexically compatible with symbols. Not so here. The backquote and square brackets in the type name shown above cannot be part of a symbol name. If you type that string of characters into the REPL, you will get

user=> System.Collections.Generic.IList`1[System.Int32]
CompilerException System.InvalidOperationException:
  Unable to resolve symbol: System.Collections.Generic.IList in this context
   at ...
1
[System.Int32]

The input string is parsed as separate entities:

System.Collections.Generic.IList
`1
[System.Int32]

Not what was intended.

In addition to backquotes and square brackets, a fully-qualified type name can contain an assembly identifier--that involves spaces and commas. In fact, CLR typenames can contain arbitrary characters. Backslashes can escape characters that do have special meaning in the typename syntax (comma, plus, ampersand, asterisk, left and right square bracket, left and right angle bracket, backslash).

To allow symbols to contain arbitrary characters, ClojureCLR extends the reader syntax using "vertical bar quoting". Vertical bars are used in pairs to surround the name or a part of the name of a symbol. Any characters between the vertical bars are taken to be part of the symbol name. For example,

|A(B)|
A|(|B|)|
A|(B)|

all mean the symbol whose name consists of the four characters A, (, B, and ).  I consider only the first one to be readable; quoting the entire name is to be preferred.  To quote the IList example above, you would write

|System.Collections.Generic.IList`1[System.Int32]|

To include a vertical bar in a symbol name that is |-quoted, use a doubled vertical bar.

|This has a vertical bar in the name ... || ...<>@#$@#$#$|

With this mechanism can we make a symbol for a fully-qualified typename, such as:

(|com.myco.mytype+nested, MyAssembly, Version=1.3.0.0, Culture=neutral, PublicKeyToken=b14a123334343434|/DoSomething x y)

or

(reify 
 |AnInterface`2[System.Int32,System.String]| 
 (m1 [x] ...)
 I2
 (m2 [x] ...))


There are a number of things you should note about |-quoting and about generic type references.

First, what |-quoting does is to prevent characters from stopping the token scan. Checks on symbol validity that follow token scanning are still in effect. These include not starting with a digit, containing a non-intial colon, and a few others. When scanning A(B), the left parenthesis stops scanning the token that begins with A.  When scanning |A(B)|, the left and right parentheses do not stop the scan.  However, scanning |ab:| is the same as scanning ab:, a colon being a perfectly fine token constituent.  However, the colon at the end is a no-no, and so the token is rejected and the reader throws an exception.

(I could have taken more radical approach and allowed |ab:|.   One then gets into all kinds of edge cases that I didn't want to solve.  I feel that a more radical quoting approach requires consultation and agreement with the Clojure powers-that-be.)


Second, be careful with namespaces for symbols. Any / appearing |-quoted does not count as a namespace/name separator. If you have special characters in either the namespace name or the symbol name, you must |-quote either one separately. Thus,

(namespace 'ab|cd/ef|gh)    ;=> nil
(name 'ab|cd/ef|gh)         ;=> "abcd/efgh"
 
(namespace 'ab/cd|ef/gh|ij) ;=> "ab"
(name 'ab/cd|ef/gh|ij)      ;=> "cdef/ghij"

Rather than

ab/cd|ef/gh|ij

it would be more readable to write

ab/|cdef/ghij|

Third, you will usually need to fully namespace-qualify generic types and their parameters.  For example,

|System.Collections.Generic.IList`1[System.Int32]|

works as a type reference while

|System.Collections.Generic.IList`1[Int32]|

does not.

Also, aliasing via import is not of much help.  After eval'ing

(import '|System.Collections.Generic.IList`1|)

the symbol |IList`1| will refer in the current namespace to the generic type |System.Collections.Generic.IList`1|, but that is of no help in referring to instantiated IList types. You cannot then refer to

|IList`1[System.Int32]|

Perhaps someday we will introduce a more compositional approach to generic types and symbols that will accommodate this.

Fourth, if you are familiar with |-quoting in Common Lisp, the ClojureCLR mechanism is not as inclusive.   In CL you could include a literal vertical bar in a symbol name with backslash-escaping: abc\|123 has name “abc|123”. CL has  \-escaping for characters in symbol tokens; ClojureCLR does not.


Finally, note that when printing with *print-dup* true, symbols with 'bad' characters will be |-quoted.