Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do languages like Java distinguish between string and char while others do not? [closed]

I've noticed that languages like Java have a char primitive and a string class. Other languages like Python and Ruby just have a string class. Those languages instead use a string of length 1 to represent a character.

I was wondering whether that distinction was because of historical reasons. I understand the language that directly influenced Java has a char type, but no strings. Strings are instead formed using char* or char[].

But I wasn't sure if there was an actual purpose for doing it that way. I'm also curious if one way has an advantage over another in certain situations.

Why do languages like Java distinguish between the char primitive and the string class, while languages like Ruby and Python do not?

Surely there must be some sort of design concern about it, be it convention, efficiency, clarity, ease of implementation, etc. Did the language designer really just pick a character representation out of a hat, so to speak?

like image 588
Eva Avatar asked Feb 21 '13 19:02

Eva


People also ask

Why do we use string instead of char in Java?

char is a primitive data type whereas String is a class in java. char represents a single character whereas String can have zero or more characters. So String is an array of chars. We define char in java program using single quote (') whereas we can define String in Java using double quotes (").

What is the difference between chars and strings?

The main difference between Character and String is that Character refers to a single letter, number, space, punctuation mark or a symbol that can be represented using a computer while String refers to a set of characters. In C programming, we can use char data type to store both character and string values.

Can you compare a string and a char in Java?

String Comparison In Java, we always use the equals() message to compare two objects – strings are just an example of that rule. For example, we used equals() to compare Color objects earlier in the quarter. The == operator is similar but is used to compare primitives such as int and char.

What is a string in Java What is the difference between a string in Java and string in C C++?

Difference Between Strings in Java vs Strings in CPP String is Object of String Class. String is Object of std::string Class. String is a character array. Strings are immutable, i.e. they cannot be changed once initialized.


1 Answers

EDIT 1 Added a number of links to sources; improved the historical story on Lisp; answered why Java has primatives. EDIT 2 Comment on modern scripting languages explaining how efficiency is no longer such a concern

Back in the old days, memory was expensive - even simple computers had only a few kilobytes. The typical terms of service you have to agree to would exceed the RAM of the whole system. This meant that data structures had to be very much smaller than those you can design today.

Computers started in Britain and the US in the 1940s and the minimum character set needed for those engineers was the Western European alphabet without any exciting accents. 0-9, A-Z and a-z is 62 characters. Add 31 control characters, space and some punctuation and you can fit all that into 7 bits. Perfect for a teletype.

Now, those 7 bits could be laid out differently on different architectures. If you used IBM, you had to know EBCDIC which was completely different to ASCII.

Languages of the '60s and '70s reflect these concerns, and packed strings into the smallest possible spaces:

  • Pascal: A packed array of bytes - fixed length and not null-terminated
  • C: Null-terminated sequence of bytes (often thought of as an array using the insanely hackerish idea that an array subscript is simply pointer arithmetic)
  • Fortran 66: Strings? You don't need them. Store a pair of characters in a integer and use READ, WRITE and FORMAT

As a programmer of these languages, I can say this sucked. Especially as most business programs required a lot of text entry and manipulation. As memory became cheaper, programmers tended to write string utilities before anything else to be able to do anything productive.

Fixed-length strings (eg Pascal) were efficient but awkward if you need to extend or contract them even a single character.

C's null-terminated approach has the disadvantage that the length is not stored with the string, so it is trivially easy to overwrite the buffer and crash the application. Such bugs are still a leading cause of computer insecurity. There are two ways of solving this:

  • Check the string length every write: This is imply scan memory until you find the null character. Ugly
  • malloc new memory and copy the string into the new memory, then free

Increasingly in the 80s standard libraries were brought in to handle strings - these were provided by the tools vendors and the OS providers. There were major moves to standardize, but the parties fought each other tooth and nail to control the standards, and it was ugly.

Increasing internationlization also brought another problem - international character sets. First, ASCII was expanded to 8 bits as ISO 8859-1 for different European languages (accents, Greek, Cyrillic), and then Unicode fully brought computers to all corners of the world. And that brought the issues of character encodings such as UTF-8, UTF-16 and how to covert between these different approaches.

I should also note that Lisp introduced garbage collection. This solves C's complexities with malloc/free. Lisp's incredibly powerful array and sequence libraries work naturally on strings.

The first major, popular language to bring these trends together was Java. It combined three improvements in the language:

  1. Internationalization and Unicode: A distinct datatype, Character and the primitive char
  2. Encapsulation: The issues with fixed-length vs. null-terminated were obviated by:
    1. Being immutable
    2. Clever optimizations in the VM and GC
  3. Libraries: All basic string manipulation features were standardized in the language.

Nowadays there are languages where every value is an object. However when Java was conceived in the late '90s, GC and JIT/Hotspot technologies were nowhere near as fast as they are now (at least partially because of RAM limitations, but algorithms have improved too). Gosling was concerned about performance and kept primitive datatypes.

One other point: In Java it is natural that there's a Character class - it is the natural home for a number of operations and utility methods such as isWhiteSpace() and isLetter(), the latter being somewhat complicated by Japanese, Korean and the Indian languages.

Python made a poor early decision to define a character as 8-bit ASCII; you can see the consequent problems by first introducing another datatype (unicode) which is subtly different and incompatible, and is only now resolved by the complicated migration to Python 3.x.

Modern languages (including scripting languages) follow to the broad consensus on how a string library should look as exemplified by Java and Python.

Each language is designed for a specific purpose and therefore balances competing design concerns in different ways. Modern languages have the benefit of the enormous improvements in performance and memory in the last 60 years, so they can favor generalization, purity and utility over efficiency in CPU and RAM. This is explicitly true of scripting languages, which by the nature of scripting has already made that decision. Modern languages therefore tend to have only the high-level string type.

TL/DR Early computers were frighteningly limited in memory forcing the simplest implementations. Modern languages benefit from GCs recognize internationalization (8bit->16bit) characters and encapsulate string datatypes to make string manipulation safe and easy.

like image 103
Andrew Alcock Avatar answered Sep 28 '22 06:09

Andrew Alcock