|
YOUR FEEDBACK
Did you read today's front page stories & breaking news?
SYS-CON.TV |
TODAY'S TOP SOA & WEBSERVICES LINKS Feature Taking the Leap - C++ containers vs C# Collections
Taking the Leap - C++ containers vs C# Collections
By: Allan Woloshin
May. 1, 2002 12:00 AM
While moving from C++ to C# means giving up template-based containers, that doesn't mean you can't effectively organize your data. And like C++, C# collections have some unique benefits. The concept of computerized arrays has been around almost as long as computers themselves. It allows a program to deal with large quantities of data almost as simply as dealing with a single unit of data. It underlies almost all sorting algorithms. C++, like most other languages, has built-in language support for arrays. In C++, arrays are always one-dimensional - but you can allocate arrays of arrays to counter that fact. The name of an array is almost always converted into a pointer to its first element, and most array operations work equally well on pointers. For nonrectangular arrays, C++ works equally well with arrays of pointers - allocating and freeing the odd-shaped array can be inconvenient, but otherwise it works extremely well. Why would anybody want anything more? There are several problems with arrays. For one thing, you cannot pass an array to a subroutine; instead, C++ automatically converts the array name into a pointer to the first element. This means that arrays are always passed by reference rather than value - just the opposite of most C++ arguments. If the subroutine changes the content of the array, the caller sees the changes, like it or not. This also means that the subroutine doesn't automatically know the size of the array! Many security holes in C and C++ programs can be attributed to functions such as gets(), which accepts the address of a character buffer but not the maximum length. Converting an array address to a pointer solves many problems, but it causes some, too. C++ has special rules governing the use of pointers, and not all of them are designed to work correctly with pointers. For instance, you can pass an array of Derived to a function that expects an array of Base and the compiler won't complain - but your program won't work correctly. Arrays are also inflexible. Your program must know the maximum size in advance or else it must be prepared to re-allocate the array (and copy all elements) whenever the array is full. The latter usually means keeping track of two different sizes - the amount of space reserved so far, and the amount of space actually used. In order to "insert" an element into the array, you must be prepared to shift every existing element after that location. Any time you expand an array or insert elements, you also change the address of elements - if another part of the program is a pointer or reference to a moved element, that pointer or reference is now "dangling" and a potential source of major program errors. Finally, arrays are pure memory devices - unless your operating system has a "virtual memory" capability, every element takes up memory all of the time. Most of these problems can be solved. You can pass a length parameter along with your array pointer; you can use special naming conventions that make it easy to spot when you've passed the wrong type of array; and so on. The thing is, you have to know what you need. In other words, it's error-prone. An alternative to the built-in array is the linked list. C and C++ make it easier to build linked lists than many other languages, and this can solve the problem of having to extend the array and of inserting new elements without moving the others. Also, linked lists usually allow mixed types; putting a Derived object in a linked list of Base probably isn't an error. But the list still needs to fit into memory; the size is still unknown unless it's separately maintained; and it's still not possible to pass the linked list by value. Furthermore, linked lists are very error-prone, especially on multi-threaded environments or in the presence of exceptions, and linked lists cannot be processed in arbitrary order. In the late 1980s and throughout the 1990s, Alexander Stepanov and David Musser, and later Meng Lee, developed a library of template classes and algorithms. They had several goals, including algorithms that were fast, powerful, and extensible. The library had many different components, but among the most useful were their container classes. In 1997 they offered their work to the ISO/IEC C++ standards committee, which adopted it, with changes.
C++ Containers
The C++ standard says, "A sequence is a kind of container that organizes a finite set of objects, all of the same type, into a strictly linear arrangement." The three types of sequences are vector, list, and deque.
When you iterate through all the elements of a sequence, you get them "in order" - the first element comes first. The C++ standard says associative containers "provide an ability for fast retrieval of data based on keys." Practically speaking, this means that the index doesn't have to take an integer. (If it does take an integer, the values don't have to be contiguous; this is known as a "sparse array.") The four types of associative containers are map, multimap, set, and multiset.
The associative containers can only contain elements that can be sorted - you can either make sure that operator less-than works correctly, or else you can construct a class object that knows how to sort the contained objects. When you iterate through all the elements of an associative container, you get them in that sorted order. The library also defines container adapters such as queue, which accepts a sequence such as list or deque, and ensures that access is always at the beginning or end; stack, which does much the same thing, except that elements are popped off in reverse order from when they were pushed on; and priority_queue, which returns the lowest element first. These standardized containers solved a lot of problems. With such a rich assortment to choose from, there is bound to be one that satisfies performance requirements for almost any need. Standard C++ library routines understand how to access data in these containers, and there is a rich variety of algorithms included in the library - algorithms for copying, sorting, even I/O. Unfortunately, there were several problems that standardized C++ containers simply couldn't solve. For instance, the standard containers require all elements to be in memory at once. (However, they provide a framework you can use to develop your own container without this requirement - and your container will work with all Standard C++ algorithms.) Mixed data types are still a major issue. Also, it's still a bad idea to have a reference or pointer to a contained element, except under certain very well-defined circumstances. To make it possible to iterate through all of the elements in a container, we need some other mechanism. These are called iterators.
C++ Containers Cannot Contain Mixed Types!
The same thing happens with arrays: void bar() {If you must store hierarchical items in an array or container, you do so by creating a container (or array) of pointers to person, as we've shown in Listing 2. The last part of the code sample illustrates a problem with this technique. You must delete all of the elements whenever the vector is going to be deleted, or else you get memory leaks. This is both tedious and error-prone.
How C++ Iterators Work
The C++ standard defines five different types of iterators:
All C++ containers have methods (member functions) that return iterators that point into the container. The type of iterator returned depends on the container type.
Most algorithms take not one, but two iterators to define a "range" of data. One iterator marks the beginning of the data and another marks the position past the end. This arrangement seems strange at first, but it turns out to be rather convenient. If the iterators are bidirectional or random, you can determine the number of included elements by subtracting the two. There is no need to use special flags to define an empty range; simply pass the same iterator value for beginning and past-the-end. Also, it makes testing for the end of a range simple; with some iterator types, you cannot use less-than or greater-than to tell if the range is finished, but a special sentinel value can be used to mark the end, and when input or output is finished, this value is returned. When manipulating a container, it's important to consider its effect on existing iterators. For instance, if you add an element to a vector and the internal capacity has to increase, iterators, except for end(), are not necessarily valid any more. This is especially inconvenient when you try to examine and manipulate a container at the same time; for instance, if you write a program that erases entries from a deque of customers, the act of deleting a customer might make you "lose your place." Many operations return new iterators; for instance, erase() might return an iterator to the element after the one that was deleted. Other operations are more problematic. There is a lot more to C++ iterators; we've only scratched the surface in this article. Consult a good C++ textbook to find out more.
List and Explain C# Collections
C# arrays can be made multidimensional in the same way that C++ arrays are: by creating an array of arrays. int[][] jagged = new int[3][];They can also be multidimensional in the classic sense, although of course this must be rectangular. // Allocate 20 elements.However, C# arrays don't support dynamic resizing or inserting in the middle, any more than C++ arrays do. If you want to do this type of operation, you need to use one of the collection classes. The list of standard C# collection types is even more rich than the list of standard C++ container classes.
C# Collections Are Weakly Typed
This eliminates an entire classification of hard-to-explain, hard-to-find errors. But this freedom doesn't come for free. C# collections return the contained object as a pointer to an object; that means most of your custom methods won't work correctly until you fix the pointer type. CollectionOfFoo[j].bar(j); // Won't compileAt the root of this minor inconvenience is a more devilish problem. Even though you know that your collection holds only Foo objects, the compiler doesn't know this. Nothing prevents you from inserting an integer into a collection of Foo objects. Every time you access one of the collected objects, the program silently renegotiates the object type. This means that using C# collections will be at least slightly slower than equivalent C++ containers, although the difference will probably be quite minor.
Comparison of C++ Container Classes with C# Collections
C# does support the equivalent of a C++ forward iterator, but this is rarely used. Instead, elements in C# collections are generally accessed in one of two ways: through the indexer or through a foreach statement. To use the indexer, simply use familiar array notation to access an element. Like C++, different C# collections can use different data types for the index. For instance, a HashTable might let you look up a customer by Customer Number. If you wish to process an entire collection or search through the elements in order, you should use the foreach statement.
How foreach Works
Consider a simple for loop and array access in C++: for(int j = 0; j < mylist.Count(); j++)The programmer that writes this has to pay careful attention to a variety of minor issues.
With the introduction of the foreach loop in C#, problems like these are outdated. foreach iterates through each element in a collection. A simple foreach loop may look like: int[] numbers = {1,2,3,4};You must have already noticed the ease of use of foreach loop and how it took care of lot of potential problem areas in the for loop. With the foreach loop, there is no special setup required for the loop, no special exit condition, no need for a function to get the size of the collection, and no need to fetch individual items in the collection. Deciding how many times to loop is the responsibility of collection, relieving the programmer of this burden. A foreach loop allows iteration for any class implementing the IEnumerable interface. This includes all C# collections and collection classes included in namespace System.Collec- tions. You can also write your own collection class and make it work with foreach by following IEnumerable guidelines. foreach will automatically place the fetched element into a temporary variable. Note that the temporary variable j cannot be modified, i.e., j is read-only. Attempts to modify j will produce a compile error. foreach(int j in numbers) {How to Write Your Own C# Collection Our first advice is: Don't do it if you don't need to! The collections included with C# are quite comprehensive, and handle a wide variety of different uses. Not every conceivable use, of course. If you have some strange memory constraints, or some need to use more than one key at the same time, you might have good reason to write your own collection. The easiest way to create your own collection is to implement the indexer property. A simple indexer property might look like this: public object this[int index] {The index does not have to be an int; it can be any data type you wish to support. You use this value to access the correct element. A get access occurs when your indexer is used anywhere in an expression except on the left-hand side of an assignment. Foo f = myCollection[3]; C# automatically calls your property get code with index set to 3. The set access occurs, naturally enough, when the index expression is on the left-hand-side of an assignment statement. myCollection[3] = f; In Listing 3, we implement an indexer and access objects using it. What happens when client code tries to use a foreach loop instead of using an indexer? class MyClass {This code will fail to compile, sending a message that it cannot find GetEnumerator(), which is a method exposed by the IEnumerable interface. For a class to be usable in a foreach loop, it must support the IEnumerable interface, which is defined in System.Collections. Perhaps surprisingly, IEnumerable has only one method; IEnumerator GetEnumerator(). GetEnumerator() returns an instance of interface IEnumerator, which is also defined in the System.Collections namespace. The IEnumerator allows unidirectional iteration (read-only access) of items in a collection. IEnumerator publishes two methods: bool MoveNext() and void Reset(); and one property, object Current {get;}. MoveNext(), like the name suggests, moves the enumerator to the next element in the collection. It returns true if the operation was successful, or false: if there aren't any elements left to enumerate. Reset() sets the enumerator to the initial position, which prepares the Enumerator for iteration over the collection. Note: Initial position is not the first element in the collection; rather, it's a virtual position before the first element in the collection. Current {get;} gets the current element in the collection. If the last call to MoveNext() returned false, or if MoveNext() hasn't even been called after Reset(), then Current {get;} throws an IndexOutOfRangeException. To make our class support all of the same types of access that other C# collections do, all we need to do is derive our class from IEnumerable, define a support class from IEnumerator, and then implement GetEnumerator() by returning an instance of our new class. (Note: It's possible to have our class derive from both IEnumerator and IEnumerable, and GetEnumerator() would just return this. We don't recommend this, because it isn't thread-safe.) Listing 4 shows all the changes required to implement an enumerator for use with the foreach construct. This class can be used the same way as any other C# collection class.
Conclusion
SOA WORLD LATEST STORIES
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
|
SYS-CON FEATURED WHITEPAPERS MOST READ THIS WEEK |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||