Courtesy of C/C++ Users Journal (March 2004)
In previous columns, I introduced recls a platform-independent library that provides recursive filesystem searching. In the process, I demonstrated techniques for integrating C/C++ libraries with C++, STL, and C# by implementing mappings to those languages. (The source code for all the versions of the libraries and the mappings is available at http://www.cuj.com/code/ and http://recls.org/downloads.html.) This month I focus on mappings to the D and Java programming languages. D is a C-like language from Walter Bright (http://www.digitalmars.com/d/). Java is the silver bullet that will put an end to C++, fulfill the promise of "write once, run anywhere," and end all our cares for the computational environment. Well, that was the theory. We're still waiting.
I've skipped the COM mappings this month for two reasons. First, I'm still undecided on how to handle Unicode. Second, it was important to include the D mapping this month.
There's really only one significant improvement to the API in this version, which is that the UNIX port has finally happened, so recls now supports UNIX and Win32.
One feature I previously promised was support for Unicode (and ANSI) character encoding. Time has not been my friend, and that's had to wait. Nonetheless, I have been able to centralize all references to character type to the new recls_char_t type, introduce the types recls_char_a_t and recls_char_w_t, and make all character handling use traits. recls_char_t is currently defined as recls_char_a_t (char), but it'll be a small step to allow selective definition as recls_char_w_t when required. The reason I'm hesitating is that I'll be doing the COM mappings next time, and I don't want to get into all the unmanageable ATL gunk that is Release MinDependency, Release MinSize, Unicode Release MinDependency, and so on. I only want one binary version, and for it to run optimally on both Windows 9x and NT.
The other promised feature was that some of the badly named API functions would go. Once again, this will have to wait.
The UNIX implementation uses the UNIXSTL glob_sequence class. This works slightly differently to the WinSTL basic_findfile_sequence<>, because of the different behavior of the Win32 FindFirstFile() and the UNIX glob() APIs. Since FindFirstFile() only ever enumerates the contents of a single directory, and only returns the filename plus extensions of the returned entries, it is a simple matter to "root" the entries returned by winstl::basic_findfile_sequence<> by recording the directory in which the search is conducted, such that the value type is able to give a full path. However, because of the power of glob(), it is possible to issue a search pattern such as "/usr/*/*h", which could result in the enumeration of entries from multiple directories. I chose, therefore, to not attempt any rooting of unixstl::glob_sequence and it simply returns the resultant entry.
The ramifications of this difference for recls at first seemed daunting, and I had cause to wonder whether my copy-and-change approach to adapting the Win32 implementation files to UNIX was wise. However, it turned out that the resultant differences are slight, and if you're a keen user of diff, you'll see that the two main implementation files recls_api_unix.cpp and recls_api_win32.cpp have ended up being similar indeed. This was assisted by a couple of other abstractions, such as the definition of simple functions such as is_dots() and file_exists(). Hence, I may be able to coalesce the two main implementation files into one in the future. The only caution is that another operating system, such as Macintosh or VMS, would require a too-different implementation, so I'll probably leave it as is for a while.
As you recall, the recls library is entirely reentrant, so there are virtually no threading concerns. The only area in which threading is even an issue is in the copying of entry handles, which used reference counting based on Win32 atomic integer operations. For UNIX, I've taken a conservative approach to the implementation, and based it on PTHREADS (see Programming with POSIX Threads, David Butenhof, Addison-Wesley, 1997), which is the most widely available threading library for UNIX. Hence, the reference counting is made threadsafe by scoping the lock/unlock of a mutex (an instance of the unixstl::thread_mutex class). However, since performance is always something you should be aware of, I've also allowed for the use of Linux kernel atomic operations, should you wish to do so. By defining the symbol RECLS_UNIX_USE_ATOMIC_ OPERATIONS, the <asm/atomic.h> file is included, and the reference counting is implemented in terms of the highly efficient atomic_* functions.
The port to UNIX went surprisingly well. Since I'm most productive in Visual Studio 98, I tend to prefer to do my coding and initial testing on Win32. However, since the APIs of Win32 and UNIX differ substantially, this can sometimes be a challenge. I use two libraries to help me in this regard. The first is a simple UNIX emulation library (http://synesis.com.au/software/index.html#unixem) for the readdir(), glob(), and gettimeofday() APIs that I wrote to support some of the development of UNIXSTL. It's not production quality, but is certainly good enough for testing.
The other, far better, library is the pthreads-win32 library (http://sources.redhat.com/pthreads-win32/). This is an almost complete implementation of PTHREADS for Win32, and is superb. Not only does it save the rest of us from this gargantuan task, but it is also the simplest thing imaginable to install, build, and use. You unzip to your directory of choice, type "nmake clean VC" (or whatever your preferred compiler), and it just builds without a problem. Simply include that directory in your include paths, and use the DLL and import library generated. If only all free software (my own included) was so straightforward!
Anyway, the upshot of all this Win32-based testing was that when I moved over to the Linux box and ran the build, it just worked the first time no errors, no warnings. Just a nice lib and executable, and it ran like a dream.
I've only tested on Linux, so it may be that there could be problems on other UNIX systems, but since it only uses glob(), stat(), and PTHREADS, I'm pretty confident it will be fine. (As before, if anyone wants me to port to another platform, just give me a sandboxed login with a compiler and I'm yours.)
There's recently been a lot of debate about D, since it's stuck its head above the programming language parapet (on comp.lang.c++ .moderated), and been shot at from all sides. I think the problem is that D has been touted as "an evolutionary successor" to C and C++. I don't think this marketing is appropriate or valid; I'm pretty sure neither of these languages has finished evolving, and I'll be surprised if anything is going to be up to the task of deposing C++ for a long time.
I prefer to think of D as an easy-to-use, extremely powerful, non-VM alternative to Java and .NET. Importantly, it maintains the ability to link to C libraries, which is a significant advantage over these other languages. It supports full templates (not the half-hearted versions touted for Java and .NET) along with several advanced features such as built-in dynamic arrays, unit testing, versioning, integrated Unicode support, and strong typedefs. It also maintains low-level features, such as pointers and inline assembler, for when you simply have to get down and dirty.
So while I don't see it replacing C++, it does provide an easier-to-use alternative for some requirements. It is certainly a more attractive option for C/C++ programmers than .NET and, especially, Java. Naturally, as with any new language, it is currently suffering from a lack of libraries, but the standard library is growing.
The mapping of recls to D involves three stages. First, the recls API functions are declared. Although D provides link compatibility with C, it does not use the preprocessor, so you need to declare the library functions in D; see Listing 1. Note the version statement that lets you define Win32 and Linux variants of the time and size types, to correspond to those defined in recls_ platform_types.h. Also note the difference between the uses of alias (like the C/C++ typedef) and typedef (a strong typedef; see my article "True-typedefs" CUJ, March 2003). You are able to define hrecls_t, recls_info_t, and so on as strong types, which means that they cannot be erroneously interchanged for variables of other types. This is a powerful aid to robustness that can be contrasted with the weak typedefs provided by C# (which I described last time).
The next step is to provide D equivalents to the raw API (Listing 2). In the main, this involves translating pointers to C strings into references into D strings (char[]). But it can also involve the translation of pointer arguments to out or inout reference parameters and the simplification of some of the functions.
There is some equivocation on whether this step is necessary in the D community. As I described in the last column, it is a good idea to insulate the P/Invoke functions from the implementation of the external API classes in C#, and I think the same applies in D. Furthermore, unlike .NET and Java, D supports the definition of free functions, so the functions defined in this step are made public, and thus represent a D-ified equivalent to the raw recls C API. In this way, library users are presented with a choice as to how they wish to use it.
The last step is to implement class wrappers (Listing 3). The Entry class is pretty unremarkable, and similar to the other mappings. It wraps an entry handle and calls the API functions to retrieve particular attributes of the entry. The Search class is another matter, although it is still reasonably straightforward. Basically, it stores the search path, pattern, and flags as member variables, and provides a single method opApply(). This is one of the special methods recognized by D, and allows instances of such classes to be enumerated by the foreach loop construct. The body of a foreach statement is converted by the D compiler into a delegate a smart function that can include a reference to an enclosing stack frame if the function delegated is nested. That is then passed to the opApply() method. (The details of this mechanism are described in "Collection Enumeration: Loops, Iterators, and Nested Functions," by Matthew Wilson and Walter Bright, DDJ, March 2004.)
The last section of Listing 3 shows how unit tests are implemented in D. The unittest sections, of which there can be any number in a source file, are compiled in and executed (prior to main()) when the -unittest flag is specified during compilation. This unittest block also provides an example of how you use the D recls classes, and hopefully demonstrates just how easy that is.
There's one last thing. You may have noticed the module name std.recls. It's a great testament to the early success of recls that it has been accepted into the D Standard Library, and will form part of the core library from Version 0.77 onwards. It would be nice if we could replicate this success with other languages.
Unlike several of the mapped languages, you cannot simply call free functions from within Java code, since Java does not support free functions. Furthermore, Java objects live in bytecode, so there is no possibility of making direct calls. Hence, for mapping native APIs to Java, you must use the Java Native Interface (JNI).
The way JNI works is conceptually simple. You declare a method in Java as you would an abstract method; for example, without a method body, but use the native keyword rather than abstract. This tells the runtime that the implementation of this method is located in an external library, not in the .class file for the given class. Listing 4 shows the org.recls.Search class. It has four native methods: Close(), Initialise(), hasMoreElements(), and nextElement(). These four methods operate on the search handle, which is represented in the Search class as in int, since an int is the same size as a pointer to a recls_fileinfo_t structure on 32-bit systems. (Of course, this wouldn't hold for 64-bit systems, but if recls was ported to a 64-bit system, the native implementation could employ a simple hash lookup to match 32-bit handles to the actual structures.)
Once you've compiled your Java classes, you need to run the javah tool to generate a C header for the native methods in the classes, which you then implement and export from a shared library (.so on UNIX; .dll on Win32). The commands used in this case are:
javah -classpath ..\..\mappings\Java\recls_java.jar-o .\JNI\include\recls_Search.h org.recls.Search
You specify the classpath in this case the recls jar along with the desired path of the output file and the class. This produces a file that looks something like:
#include <jni.h>#undef org_recls_Search_RECLS_F_FILES
#define org_recls_Search_RECLS_F_FILES 1L
...
#undef org_recls_Search_RECLS_F_DETAILS_LATER
#define org_recls_Search_RECLS_F_DETAILS_LATER 524288L
/* Class: org_recls_Search
* Method: Close
* Signature: ()V
*/
JNIEXPORT void JNICALL Java_org_recls_Search_Close
(JNIEnv *, jobject);
/* Class: org_recls_Search
* Method: Initialise
* Signature: (Ljava/lang/String;Ljava/lang/String;I)I
*/
JNIEXPORT jint JNICALL Java_org_recls_Search_Initialise
(JNIEnv *, jobject, jstring, jstring, jint);
...
There's a caveat with respect to the generation of the native headers. In the original makefile I fell victim to the copy-paste monster, and the javah commands actually looked like:
javah -classpath ..\..\mappings\Java\recls_java.jar-o .\JNI\include\recls_Search.h org\recls\Search
Strangely, rather than telling me it did not know about a class "org\recls\Search" in the root package, javah managed to extract the necessary information to create the native headers, but replaced the "\" with its Unicode equivalent _0005c, as per the JNI naming standard, so the generated functions had names such as "Java_ org_0005crecls_0005cEntry_getDrive." Being valid C identifiers, the library compiled and built without a problem, but the VM would understandably fail to find the method at runtime. Although this same thing had caught me several years ago, it failed to click for quite some time. Naturally, this is a bug in javah, but you need to be aware of it to avoid the same problem.
You may be asking how the implementation knows where to pick up the native method implementations? Well, if you look again at Listing 4 you can see that the Search class has a static initializer, within which System.loadLibrary() is called, passing the name of our dynamic library: recls_jni. For UNIX systems it must be called librecls_jni.so, and for Win32 systems recls_jni.dll. Since Entry also uses the native library, it's arguable that I should have added a similar static initializer to the Entry class, but since you cannot create an instance of Entry other than via an instance of Search, I think we're pretty safe. But to add it would be entirely benign, since the VM manages the loading of native libraries, and any necessary reference counting.
Let's look now at the implementation of these methods. Listing 5 shows the implementation of the Search.nextElement() method. Essentially, most JNI method implementations work in the same way: Everything is retrieved or set by using the JNI environment handle (JNIEnv*) passed to each function. To access a field or method of a class, you first retrieve a reference to the class by calling GetObjectClass(), on the class reference passed to your (nonstatic) method implementation function. Then you look up the identifier for the field or method by GetFieldID() or GetMethodID(). Finally, you retrieve, set, or call the field or method. For example, to retrieve the recls search handle (as an int) from an instance of the Search class we need to execute the following:
// (i) Get the object's class reference.jclass cls = henv->GetObjectClass(obj);
// (ii) Lookup the m_hSearch field, which is an int ("I")
jfieldID fid = henv->GetFieldID(cls, "m_hSearch", "I");
// (iii) Retrieve the integer value of the field
int val = henv->GetIntField(obj, fid);
This kind of thing is very cumbersome, so one always uses helper functions. Thus, to get the hrecls_t from a Search instance we use SearchFromJObject(), as you can see in Listing 5.
When it comes to calling methods, we have a little bit more complexity to handle method descriptors. For example, to create an instance of the Entry class, from within the native implementation of Search.nextElement(), we need to execute the constructor for Entry. This requires the constructor method ID, which is retrieved from the Entry class reference via a call to GetMethodID() passing the special constructor name "<init>". Since constructors, along with other methods in Java, can be overloaded, GetMethodID() also requires a method descriptor: in this case it is "(ILjava/lang/String;[Ljava/lang/ String;)V", which means a function that takes an int and two Strings; and returns void. Fairly trips off the tongue, doesn't it?
To be fair, you get used to the notation pretty quickly, but it is still easy to make a mistake. However, it's not that hard to write utility templates that construct specification strings on the fly based on the parameter types. I shall leave that as an exercise for the reader. A full description of method descriptors is available from the Java web site (http://java.sun.com/).
The other issues you need to be aware of to do basic JNI are how to create and populate arrays, and how to throw exceptions, which are also in Listing 5. One thing you should remember is that you throw a Java exception in C or C++ by calling a method, but execution of the JNI function continues, and the exception is only thrown in the Java VM once the native method returns. Clearly, once you've thrown the exception, the only thing you should do before returning is local and/or native cleanup, since there's no point continuing to make changes that will affect the Java.
One thing that really bugs me is that the strings that are passed to JNI from the native code must be null-terminated. I'm sure we've all come to appreciate the utility and efficiency that comes from considering strings (and other things) merely as iterator ranges, as a result of our inculcation into STL, praise be upon it. The solution to this inconvenience and, in many cases, inefficiency can be seen in the StringFromRange() helper, which uses auto_buffer (see my article "Efficient Variable Automatic Buffer," CUJ, December 2003) to efficiently handle the allocation of a local buffer within which to write a null-terminated equivalent from which the Java string is generated.
Speaking of efficiency, I'm sure you're guessing from what you've seen already that JNI is not exactly an efficient mechanism; the translation from native to VM and back is very costly. Indeed, even the Sun textbooks (the best for JNI is The Java Native Interface, by Sheng Liang, Addison-Wesley, 1999) recommend that you make informed choices as to the level of granularity at which you make the VM/native interface. Because of this, I chose a somewhat mixed implementation strategy. I assumed that the full path of a given entry is more likely to be used than not, so instances of Entry are created within the JNI (in the Search.nextElement() method) by passing in the path String. This saves the additional cost of a JNI call when the path is retrieved. The same goes for the directory parts: Since they are optional, I assume that if you've asked for them then you will be using them. But all other attributes of an Entry are retrieved, via JNI calls, on demand. Naturally, this may not be the best balance between the construction and footprint of Entry instances and the costs of retrieving the various properties. Only experimentation would determine that. However, the current version demonstrates both approaches, and if you want to research the matter, I'll be glad to adjust the implementation accordingly.
If you look closely at the definition of Entry (Listing 6), you can see that I've used private methods getCreationTime_(), getModificationTime_(), and so on that return long for the implementation of the method that return Date instances. The reason is simple, albeit a bit shameful: I find working with JNI such a pain that rather than going through the rigmarole to construct an instance of Date from within JNI, I simply convert the operating-system-specific time returned by the recls API functions to a UNIX time, and return that as a long. The Java handles the conversion from long to Date, using a Date conversion constructor. In a production environment you may not get away with such things.
Debugging JNI components is a lot easier than you might think. From within Visual Studio 98, you specify the name of the executable to be the fully qualified path of java.exe. (Take care to ensure you don't specify javac.exe as I did, or you may also waste several minutes trying to figure out why it's not working.) You will need to specify the name of your test class, for example, recls_test (Listing 7), along with the classpath (either the recls_java.jar or the parent directory of org/recls/. You also need to make sure the native library is available to the runtime, either by placing it in the current directory, or a system directory, or modifying the requisite environment variable: PATH in Win32; LD_LIBRARY_PATH in UNIX. A simple method for use in debugging is to specify a system property, by passing Djava.library.path=XXXX, where XXXX is the location of the native library.
The last thing I am is a one-language kind of guy (that's why I'm writing this column), and I do enjoy using a variety of languages, but I have to say that, outside certain domains, Java remains for me an unappealing thing: inefficient, overly restrictive, and generally painful to use. Sure, it might be my language of choice if I was writing an enterprise server, but I'll be looking elsewhere (D and .NET) when C/C++ and Perl/Python aren't the appropriate solutions.
Notwithstanding my biased gripes, once you've correctly implemented your JNI components, they are as straightforward to use as any other Java classes, as in Listing 7.
The standard way to implement documentation in Java is to use JavaDoc. The JavaDoc comments are simply C-style comments with an extra *, as in "/** This is JavaDoc style */". Although they're not shown in the listings, the Java classes bear these tags. However, I've not actually built the JavaDoc for two reasons. First, the Doxygen project I'm using for most of recls incorporates the Java classes without an issue. Second, I ran out of time, and given that they're included in the main documentation I have let it slide for the moment.
As far as the documentation for D goes, they also contain Doxygen tags, but again it is the case that time was not on my side. Although I've used Doxygen with D, it must use a custom INPUT_FILTER (my modified version of Burton Radon's dfilter.exe, which comes as part of DIG from http://www.opend.org/), and I've not yet modified it to pass through the source of other languages untouched. Therefore, this version of recls does not contain help for the D mapping.
For the next installment, I plan on achieving (most of) the following: