7. Strings¶

7.1. A compound data type¶

So far we have seen five types: `int`, `float`, `bool`, `NoneType` and `str`. Strings are qualitatively different from the other four because they are made up of smaller pieces — characters.

Types that comprise smaller pieces are called compound data types. Depending on what we are doing, we may want to treat a compound data type as a single thing, or we may want to access its parts. This ambiguity is useful.

The bracket operator selects a single character from a string:

```>>> fruit = "banana"
>>> letter = fruit[1]
>>> print(letter)
```

The expression `fruit[1]` selects character number 1 from `fruit`. The variable `letter` refers to the result. When we display `letter`, we get a surprise:

```a
```

The first letter of `"banana"` is not `a`, unless you are a computer scientist. For perverse reasons, computer scientists always start counting from zero. The 0th letter ( zero-eth ) of `"banana"` is `b`. The 1th letter ( one-eth ) is `a`, and the 2th ( two-eth ) letter is `n`.

If you want the zero-eth letter of a string, you just put 0, or any expression with the value 0, in the brackets:

```>>> letter = fruit[0]
>>> print(letter)
b
```

The expression in brackets is called an index. An index specifies a member of an ordered set, in this case the set of characters in the string. The index indicates which one you want, hence the name. It can be any integer expression.

7.2. Length¶

The `len` function returns the number of characters in a string:

```>>> fruit = "banana"
>>> len(fruit)
6
```

To get the last letter of a string, you might be tempted to try something like this:

```length = len(fruit)
last = fruit[length]       # ERROR!
```

That won’t work. It causes the runtime error `IndexError: string index out of range`. The reason is that there is no 6th letter in `"banana"`. Since we started counting at zero, the six letters are numbered 0 to 5. To get the last character, we have to subtract 1 from `length`:

```length = len(fruit)
last = fruit[length-1]
```

Alternatively, we can use negative indices, which count backward from the end of the string. The expression `fruit[-1]` yields the last letter, `fruit[-2]` yields the second to last, and so on.

7.3. Traversal and the `for` loop¶

A lot of computations involve processing a string one character at a time. Often they start at the beginning, select each character in turn, do something to it, and continue until the end. This pattern of processing is called a traversal. One way to encode a traversal is with a `for` loop:

```for index in range(len(fruit)):
letter = fruit[index]
print(letter)
```

This loop traverses the string and displays each letter on a line by itself. The `range()` operator takes the length of the string, and thus returns the sequence starting from 0 and ending at `len(fruit)-1`. The loop will then beginning with `index` 0 in the string `fruit`, assigning `letter` the first character. The last character accessed is the one with the index `len(fruit)-1`, which is the last character in the string.

Using an index to traverse a set of values is so common that Python provides an alternative, simpler syntax — the `for` loop:

```for char in fruit:
print(char)
```

Each time through the loop, the next character in the string is assigned to the variable `char`. The loop continues until no characters are left.

The following example shows how to use concatenation and a `for` loop to generate an abecedarian series. Abecedarian refers to a series or list in which the elements appear in alphabetical order. For example, in Robert McCloskey’s book Make Way for Ducklings, the names of the ducklings are Jack, Kack, Lack, Mack, Nack, Ouack, Pack, and Quack. This loop outputs these names in order:

```prefixes = "JKLMNOPQ"
suffix = "ack"

for letter in prefixes:
print(letter + suffix)
```

The output of this program is:

```Jack
Kack
Lack
Mack
Nack
Oack
Pack
Qack
```

Of course, that’s not quite right because Ouack and Quack are misspelled. You’ll fix this as an exercise below.

7.4. String slices¶

A substring of a string is called a slice. Selecting a slice is similar to selecting a character:

```>>> s = "Peter, Paul, and Mary"
>>> print(s[0:5])
Peter
>>> print(s[7:11])
Paul
>>> print(s[17:21])
Mary
```

The operator `[n:m]` returns the part of the string from the n-eth character to the m-eth character, including the first but excluding the last. This behavior is counterintuitive; it makes more sense if you imagine the indices pointing between the characters, as in the following diagram:

If you omit the first index (before the colon), the slice starts at the beginning of the string. If you omit the second index, the slice goes to the end of the string. Thus:

```>>> fruit = "banana"
>>> fruit[:3]
'ban'
>>> fruit[3:]
'ana'
```

What do you think `s[:]` means?

7.5. String comparison¶

The comparison operators work on strings. To see if two strings are equal:

```if word == "banana":
print("Yes, we have no bananas!")
```

Other comparison operations are useful for putting words in lexigraphical order:

```if word < "banana":
print("Your word, " + word + ", comes before banana.")
elif word > "banana":
print("Your word, " + word + ", comes after banana.")
else:
print("Yes, we have no bananas!")
```

This is similar to the alphabetical order you would use with a dictionary, except that all the uppercase letters come before all the lowercase letters. As a result:

```Your word, Zebra, comes before banana.
```

A common way to address this problem is to convert strings to a standard format, such as all lowercase, before performing the comparison. A more difficult problem is making the program realize that zebras are not fruit.

7.6. Strings are immutable¶

It is tempting to use the `[]` operator on the left side of an assignment, with the intention of changing a character in a string. For example:

```greeting = "Hello, world!"
greeting[0] = 'J'            # ERROR!
print(greeting)
```

Instead of producing the output `Jello, world!`, this code produces the runtime error `TypeError: 'str' object doesn't support item assignment`.

Strings are immutable, which means you can’t change an existing string. The best you can do is create a new string that is a variation on the original:

```greeting = "Hello, world!"
new_greeting = 'J' + greeting[1:]
print(new_greeting)
```

The solution here is to concatenate a new first letter onto a slice of `greeting`. This operation has no effect on the original string.

7.7. The `in` operator¶

The `in` operator tests if one string is a substring of another:

```>>> 'p' in 'apple'
True
>>> 'i' in 'apple'
False
>>> 'ap' in 'apple'
True
>>> 'pa' in 'apple'
False
```

Note that a string is a substring of itself:

```>>> 'a' in 'a'
True
>>> 'apple' in 'apple'
True
```

Combining the `in` operator with string concatenation using `+`, we can write a function that removes all the numbers from a string:

```def remove_numbers(s):
numbers = "0123456789"
s_without_numbers = ""
for letter in s:
if letter not in numbers:
s_without_numbers += letter
return s_without_numbers
```

Test this function to confirm that it does what we wanted it to do.

7.8. A `find` function¶

What does the following function do?

```def find(strng, ch):
index = 0
while index < len(strng):
if strng[index] == ch:
return index
index += 1
return -1
```

In a sense, `find` is the opposite of the `[]` operator. Instead of taking an index and extracting the corresponding character, it takes a character and finds the index where that character appears. If the character is not found, the function returns `-1`.

This is the first example we have seen of a `return` statement inside a loop. If `strng[index] == ch`, the function returns immediately, breaking out of the loop prematurely.

If the character doesn’t appear in the string, then the program exits the loop normally and returns `-1`.

This pattern of computation is sometimes called a eureka traversal because as soon as we find what we are looking for, we can cry Eureka! and stop looking.

7.9. Looping and counting¶

The following program counts the number of times the letter `a` appears in a string, and is another example of the counter pattern introduced in Counting digits:

```fruit = "banana"
count = 0
for char in fruit:
if char == 'a':
count += 1
print(count)
```

7.10. Optional parameters¶

To find the locations of the second or third occurence of a character in a string, we can modify the `find` function, adding a third parameter for the starting postion in the search string:

```def find2(strng, ch, start):
index = start
while index < len(strng):
if strng[index] == ch:
return index
index += 1
return -1
```

The call `find2('banana', 'a', 2)` now returns `3`, the index of the first occurance of ‘a’ in ‘banana’ after index 2. What does `find2('banana', 'n', 3)` return? If you said, 4, there is a good chance you understand how `find2` works.

Better still, we can combine `find` and `find2` using an optional parameter:

```def find(strng, ch, start=0):
index = start
while index < len(strng):
if strng[index] == ch:
return index
index += 1
return -1
```

The call `find('banana', 'a', 2)` to this version of `find` behaves just like `find2`, while in the call `find('banana', 'a')`, `start` will be set to the default value of `0`.

If we add another optional parameter to `find` we can use it to search either forward or backward:

```def find(strng, ch, start=0, step=1):
index = start
while 0 <= index < len(strng):
if strng[index] == ch:
return index
index += step
return -1
```

Passing in a value of `len(strng)-1` for start and `-1` for `step` will make `find` search from the end of the string toward the beginning. Note that we needed to check a lower bound for `index` in the while loop as well as an upper bound to accomodate this change.

7.11. Methods on strings¶

The `str` class has a number of useful functions that can manipulate strings. Functions that are part of a class are called methods. (We will learn more about classes and methods in the Classes and objects chapter.)

To see what string methods are available to us, we can use the `dir` function with the class name as an argument which returns the list containing the available methods.

```>>> dir(str)
``['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__',
'__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__iter__',
'__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__',
'__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize',
'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index',
'isalnum', 'isalpha', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace',
'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind',
'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase',
'title', 'translate', 'upper', 'zfill']``
```

To find out more about it, we can print out its docstring, `__doc__`, which contains documentation on the function:

```>>> print(str.find.__doc__)
S.find(sub[, start[, end]]) -> int

Return the lowest index in S where substring sub is found,
such that sub is contained within S[start:end].  Optional
arguments start and end are interpreted as in slice notation.

Return -1 on failure.
```

Alternatively, you might want to see what happens if you type `help(str.find)`.

The parameters in square brackets are optional parameters. When we use the `str.find` method, we replace the `str` with the string we want to search in:

```>>> fruit = "banana"
>>> index = fruit.find("a")
>>> print(index)
1
```

Notice that `str.find` is more general than our version. it can find substrings, not just characters:

```>>> "banana".find("na")
2
```

Like ours, it takes an additional argument that specifies the index at which it should start:

```>>> "banana".find("na", 3)
4
```

Unlike ours, its second optional parameter specifies the index at which the search should end:

```>>> "bob".find("b", 1, 2)
-1
```

In this example, the search fails because the letter b does not appear in the index range from `1` to `2` (not including `2`).

7.12. Character classification¶

It is often helpful to examine a character and test whether it is upper- or lowercase, or whether it is a character or a digit. Let’s examine a few different ways of doing this.

If we had a character `ch` and we wanted to know if it was a lowercase letter, we could look to see if `ch` was found in a string of all the lowercase letters. If `str.find` returns a value other than `-1`, then `ch` must be lowercase:

```def is_lower(ch):
return "abcdefghijklmnopqrstuvwxyz".find(ch) != -1
```

Alternatively, we can take advantage of the `in` operator:

```def is_lower(ch):
return ch in "abcdefghijklmnopqrstuvwxyz"
```

As yet another alternative, we can use the comparison operator:

```def is_lower(ch):
return 'a' <= ch <= 'z'
```

If `ch` is between a and z, it must be a lowercase letter.

Lastly, we use could use the `islower` method of the string class:

```def is_lower(ch):
return ch.islower()
```

Similarly, there is a `str.isupper()` method that returns True if the string is uppercase.

Other methods in the `str` class may surprise you when you look at their documentation:

```>>> print(str.isspace.__doc__)
S.isspace() -> bool

Return True if all characters in S are whitespace
and there is at least one character in S, False otherwise.
```

Whitespace characters move the cursor without appearing to print anything. They create the white space between visible characters (at least on white paper). Examples of whitespace characters include space `"  "`, tab (`"\t"`), and newline (`"\n"`).

There are other useful functions in the string class, but this book isn’t intended to be a reference manual. On the other hand, the Python Library Reference is. Along with a wealth of other documentation, it’s available from the Python website, http://www.python.org.

7.13. String formatting¶

The most concise and powerful way to format a string in Python is to use the string formatting operator, `%`, together with Python’s string formatting operations. To see how this works, let’s start with a few examples:

```>>> "His name is %s." % ("Arthur")
'His name is Arthur.'
>>> name = "Alice"
>>> age = 10
>>> "I am %s and I am %d years old." % (name, age)
'I am Alice and I am 10 years old.'
>>> n1 = 4
>>> n2 = 5
>>> "2**10 = %d and %d * %d = %f" % (2**10, n1, n2, n1 * n2)
'2**10 = 1024 and 4 * 5 = 20.000000'
>>>
```

The syntax for the string formatting operation looks like this:

```"<FORMAT>" % (<VALUES>)
```

It begins with a format string which contains a sequence of characters and conversion specifications. Conversion specifications always start with a `%`. In the previous examples we saw three conversion specifications: `%s`, `%d`, and `%f`. Following the format string is a single `%` and then a sequence of values, one per conversion specification, separated by commas and enclosed in parentheses. The parentheses are optional if there is only a single value, but it is good practice to always include them. The conversion specifications indicate where in the formatted string the values should be placed and in some cases how the values should be converted to strings.

In the first example above, there is a single conversion specification, `%s`, which indicates a string. The single value, `"Arthur"`, maps to it.

In the second example, `name` has string value, `"Alice"`, and `age` has integer value, `10`. These map to the two converstion specifications, `%s` and `%d`. The `d` in the second converstion specification indicates that the value is a decimal (base 10) integer.

In the third example, variables `n1` and `n2` have integer values `4` and `5`, respectively. There are four conversion specifications in the format string: three `%d`’s and a `%f`. The `f` indicates that the value should be represented as a floating point number. The four values that map to the four converstion specifications are: `2**10`, `n1`, `n2`, and `n1 * n2`.

`s`, `d`, and `f` are all the conversion types we will need for this book. To see a complete list, see the String Formatting Operations section of the Python Library Reference.

We can also use padding to specify the minimum number of characters a value should occupy when it is formatted. If the formatted value is too short extra blank space characters will be added. For example:

```>>> "%6s" % "hi"
'    hi'
>>> "%-6s" % "hi"
'hi    '
>>> "%6s" % "hi there, pythonista!"
'hi there, pythonista!'
```

The numbers in the conversion spefications indicate the minimum size of the resulting string. The `-` in the second example tells the formatter to put any necessary padding to the right. This is also called left-justification because the value ends up on the left side of the formatted string. The final example shows that these numbers really are specifying a minimum width; the string we supplied was longer than six characters, but python still prints the entire string, not just the first six characters, `"hi the"`.

Padding is useful if we want to display data in neatly aligned columns. Without string formatting we might try to do something like this:

```i = 1
print("i\ti**2\ti**3\ti**5\ti**10\ti**20")
while i <= 10:
print("%d\t%d\t%d\t%d\t%d\t%d" % (i, i**2, i**3, i**5, i**10, i**20))
i += 1
```

This program prints out a table of various powers of the numbers from 1 to 10. In its current form it relies on the tab character ( `\t`) to align the columns of values, but this breaks down when the values in the table get larger than the 8 character tab width:

```i       i**2    i**3    i**5    i**10   i**20
1       1       1       1       1       1
2       4       8       32      1024    1048576
3       9       27      243     59049   3486784401
4       16      64      1024    1048576         1099511627776
5       25      125     3125    9765625         95367431640625
6       36      216     7776    60466176        3656158440062976
7       49      343     16807   282475249       79792266297612001
8       64      512     32768   1073741824      1152921504606846976
9       81      729     59049   3486784401      12157665459056928801
10      100     1000    100000  10000000000     100000000000000000000
```

One possible solution would be to change the tab width, but the first column already has more space than it needs. The best solution would be to set the width of each column independently. We can use padding to accomplish this:

```i = 1
print("%-4s%-5s%-6s%-8s%-13s%s" % ('i', 'i**2', 'i**3', 'i**5', 'i**10', 'i**20'))
while i <= 10:
print("%-4s%-5s%-6s%-8s%-13s%s" % (i, i**2, i**3, i**5, i**10, i**20))
i += 1
```

Running this code produces the following output:

```i   i**2 i**3  i**5    i**10        i**20
1   1    1     1       1            1
2   4    8     32      1024         1048576
3   9    27    243     59049        3486784401
4   16   64    1024    1048576      1099511627776
5   25   125   3125    9765625      95367431640625
6   36   216   7776    60466176     3656158440062976
7   49   343   16807   282475249    79792266297612001
8   64   512   32768   1073741824   1152921504606846976
9   81   729   59049   3486784401   12157665459056928801
10  100  1000  100000  10000000000  100000000000000000000
```

Here we have specified the width of each column by choosing padding values big enough to accommodate all the numbers in that column. Notice that the specifier for the last column is just `%s`. Can you explain why that is?

Finally, when formatting floats with `%f`, you can specify the precsion — i.e. the number of decimal places:

```>>> "%.3f" % 123.456789
'123.457'
>>> "%.3f" % 123.4
'123.400'
```

As you can see from the first example, python will round the float appropriately. You can also combine all three of these options, specifying the padding, the justification, and the precision all at once:

```>>> "My number is %-10.3f" % 123.456789
'My number is 123.457   '
```

7.14. Summary and First Exercises¶

This chapter introduced a lot of new ideas. The following summary and set of exercises may prove helpful in remembering what you learned:

indexing (`[]`)

Access a single character in a string using its position (starting from 0). Example: `'This'[2]` evaluates to `'i'`.

length function (`len`)

Returns the number of characters in a string. Example: `len('happy')` evaluates to `5`.

for loop traversal (`for`)

Traversing a string means accessing each character in the string, one at a time. For example, the following for loop:

```for letter in 'Example':
print(2 * letter)
```

will print each letter of the string doubled (e.g. `EE`), each on its own line.

slicing (`[:]`)

A slice is a substring of a string. Example: ```'bananas and cream'[3:6]``` evaluates to `ana` (so does ```'bananas and cream'[1:4]```).

string comparison (`>, <, >=, <=, ==`)

The comparision operators work with strings, evaluating according to lexigraphical order. Examples: `'apple' < 'banana'` evaluates to `True`. `'Zeta' < 'Appricot'` evaluates to `False` . `'Zebra' <= 'aardvark'` evaluates to `True` because all upper case letters precede lower case letters.

in operator (`in`)

The `in` operator tests whether one character or string is contained inside another string. Examples: ```'heck' in "I'll be checking for you."``` evaluates to `True`. ```'cheese' in "I'll be checking for you."``` evaluates to `False`.

7.14.1. First Exercises¶

1. Write the Python interpreter’s evaluation to each of the following expressions:

```>>> 'Python'[1]
```
```>>> "Strings are sequences of characters."[5]
```
```>>> len("wonderful")
```
```>>> 'Mystery'[:4]
```
```>>> 'p' in 'Pinapple'
```
```>>> 'apple' in 'Pinapple'
```
```>>> 'pear' in 'Pinapple'
```
```>>> 'apple' > 'pinapple'
```
```>>> 'pinapple' < 'Peach'
```
2. Write Python code to make each of the following doctests pass:

```"""
>>> type(fruit)
<type 'str'>
>>> len(fruit)
8
>>> fruit[:3]
'ram'
"""
```
```"""
>>> group = "John, Paul, George, and Ringo"
>>> group[12:x]
'George'
>>> group[n:m]
'Paul'
>>> group[:r]
'John'
>>> group[s:]
'Ringo'
"""
```
```"""
>>> len(s)
8
>>> s[4:6] == 'on'
True
"""
```

7.15. Glossary¶

compound data type

A data type in which the values are made up of components, or elements, that are themselves values.

default value

The value given to an optional parameter if no argument for it is provided in the function call.

docstring

A string constant on the first line of a function or module definition (and as we will see later, in class and method definitions as well). Docstrings provide a convenient way to associate documentation with code. Docstrings are also used by the `doctest` module for automated testing.

dot notation

Use of the dot operator, `.`, to access functions inside a module.

immutable

A compound data type whose elements cannot be assigned new values.

index

A variable or value used to select a member of an ordered set, such as a character from a string.

optional parameter

A parameter written in a function header with an assignment to a default value which it will receive if no corresponding argument is given for it in the function call.

slice

A part of a string (substring) specified by a range of indices. More generally, a subsequence of any sequence type in Python can be created using the slice operator (`sequence[start:stop]`).

traverse

To iterate through the elements of a set, performing a similar operation on each.

whitespace

Any of the characters that move the cursor without printing visible characters. The constant `string.whitespace` contains all the white-space characters.

7.16. Exercises¶

1. Modify:

```prefixes = "JKLMNOPQ"
suffix = "ack"

for letter in prefixes:
print(letter + suffix)
```

so that `Ouack` and `Quack` are spelled correctly.

2. Encapsulate

```fruit = "banana"
count = 0
for char in fruit:
if char == 'a':
count += 1
print(count)
```

in a function named `count_letters`, and generalize it so that it accepts the string and the letter as arguments.

3. Now rewrite the `count_letters` function so that instead of traversing the string, it repeatedly calls `find` (the version from Optional parameters), with the optional third parameter to locate new occurences of the letter being counted.

4. Which version of `is_lower` do you think will be fastest? Can you think of other reasons besides speed to prefer one version or the other?

5. Create a file named `stringtools.py` and put the following in it:

```def reverse(s):
"""
>>> reverse('happy')
'yppah'
>>> reverse('Python')
'nohtyP'
>>> reverse("")
''
>>> reverse("P")
'P'
"""

if __name__ == '__main__':
import doctest
doctest.testmod()
```

Add a function body to `reverse` to make the doctests pass.

6. Add `mirror` to `stringtools.py` .

```def mirror(s):
"""
>>> mirror("good")
'gooddoog'
>>> mirror("yes")
'yessey'
>>> mirror('Python')
'PythonnohtyP'
>>> mirror("")
''
>>> mirror("a")
'aa'
"""
```

Write a function body for it that will make it work as indicated by the doctests.

7. Include `remove_letter` in `stringtools.py` .

```def remove_letter(letter, strng):
"""
>>> remove_letter('a', 'apple')
'pple'
>>> remove_letter('a', 'banana')
'bnn'
>>> remove_letter('z', 'banana')
'banana'
>>> remove_letter('i', 'Mississippi')
'Msssspp'
"""
```

Write a function body for it that will make it work as indicated by the doctests.

8. Finally, add bodies to each of the following functions, one at a time

```def is_palindrome(s):
"""
>>> is_palindrome('abba')
True
>>> is_palindrome('abab')
False
>>> is_palindrome('tenet')
True
>>> is_palindrome('banana')
False
>>> is_palindrome('straw warts')
True
"""
```
```def count(sub, s):
"""
>>> count('is', 'Mississippi')
2
>>> count('an', 'banana')
2
>>> count('ana', 'banana')
2
>>> count('nana', 'banana')
1
>>> count('nanan', 'banana')
0
"""
```
```def remove(sub, s):
"""
>>> remove('an', 'banana')
'bana'
>>> remove('cyc', 'bicycle')
'bile'
>>> remove('iss', 'Mississippi')
'Missippi'
>>> remove('egg', 'bicycle')
'bicycle'
"""
```
```def remove_all(sub, s):
"""
>>> remove_all('an', 'banana')
'ba'
>>> remove_all('cyc', 'bicycle')
'bile'
>>> remove_all('iss', 'Mississippi')
'Mippi'
>>> remove_all('eggs', 'bicycle')
'bicycle'
"""
```

until all the doctests pass.

9. Try each of the following formatted string operations in a Python shell and record the results:

1. `"%s %d %f" % (5, 5, 5)`

2. `"%-.2f" % (3)`

3. `"%-10.2f%-10.2f" % (7, 1.0/2)`

4. `print(" \$%5.2f\n \$%5.2f\n \$%5.2f" % (3, 4.5, 11.2))`

10. The following formatted strings have errors. Fix them:

1. `"%s %s %s %s" % ('this', 'that', 'something')`

2. `"%s %s %s" % ('yes', 'no', 'up', 'down')`

3. `"%d %f %f" % (3, 3, 'three')`