November 21, 2019

How to copy without copying

Efficient usage of the disk

For today’s class we need to “copy” several files to each one of your folders

But we will not modify the files. They will be read only

And we do not want 30 copies of the same file. It will use too much disk

Copy without copy

Instead of copying, we will link the files to your folder

mkdir Gutenberg
cd Gutenberg
ln /home/andres/Gutenberg/* .
ls -l

Result should be like this

total 6316
-rw-r--r-- 1 andres andres  594933 Nov 14 12:46 Adventures_of_Sherlock_Holmes.txt
-rw-r--r-- 1 andres andres 2347825 Nov 14 12:46 Don_Quixote.txt
-rw-r--r-- 1 andres andres  405900 Nov 14 12:46 Dubliners.txt
-rw-r--r-- 1 andres andres  748536 Nov 14 12:46 Educating_by_Story-Telling.txt
-rw-rw-r-- 1 andres andres 1938980 Nov 15 09:52 english
-rw-r--r-- 1 andres andres  141420 Nov 14 12:46 Metamorphosis.txt
-rw-r--r-- 1 andres andres  272277 Nov 14 12:46 Study_In_Scarlet.txt

ln gives new names to existing files

Each physical file on the disk can have several names

ln creates a new name to an existing file

You can see the number of names of a file in the output of ls -l

What happens when we use rm?

Searching for a pattern

We use grep to look for a pattern in one or more files

We use these options:

grep --color 'regex' file ...
grep --only 'regex' file ...
grep --count 'regex' file ...

The pattern is a regex. This means Regular Expression

A regex describes several words with a single text

One word can match several lines

grep --color 'analyze' english
analyze
analyzed
analyzer
analyzer's
analyzers
analyzes
psychoanalyze
psychoanalyzed
psychoanalyzes

Only lines starting with “analyze”

grep --color '^analyze' english
analyze
analyzed
analyzer
analyzer's
analyzers
analyzes

The symbol ^ represents “start of line”

Only lines ending with “analyze”

grep --color 'analyze$' english
analyze
psychoanalyze

The symbol $ represents “end of line”

Only “analyze”

grep --color '^analyze$' english
analyze

Symbols ^ and $ are called “anchors”

Exercise

Count how many times the word “Sherlock” appears on each text file in the Gutenberg folder

  • At the beginning of each file
  • At the end
  • anywhere

Searching American and British

There are small differences between American and British versions of the English language

grep --color '^analyze$' english
analyze
grep --color '^analyse$' english
analyse

Looking for both at the same time

grep --color '^analy[sz]e$' english
analyse
analyze

The symbols [ and ] indicate a character class

That is, one letter from a set of letters

Character classes

A character class allows you to match a range or set of characters

Example: [aeiou] will match any (English) vowel

This matches “c”, followed by a vowel, followed by “t”

grep --only 'c[aeiou]t' english | sort |uniq
cat
cet
cit
cot
cut

Negated Character Classes

We can also use character classes to specify characters we don’t want to match. These are called negated character classes

They are created by putting a caret ^ at the be-ginning of the class

This will match a “c”, followed by a non- vowel, followed by a “t”:

grep --only 'c[^aeiou]t' english | sort |uniq
cht
ckt
cst
cyt

Exercise

Show the complete matching line with the pattern in color for the following regex

  • ‘c[aeiou]t’
  • ‘c[^aeiou]t’

Ranges

You can also match a range of characters using a character class. For example,

[a-i]

will match any of the letters between a and i (inclusive)

Character classes work with numbers too

This matches a date between 1000 and 9999:

grep '[1-9][0-9][0-9][0-9]' *txt

Any letter

The symbol . represents any character

grep --only 'c.t' english | sort |uniq
cat
cet
cht
cit
ckt
cot
cst
cut
cyt

Exercise

Find all the lines ending with “Holmes” followed by a single character

Repetitions

The * symbol means that something should be repeated zero or more times

That is, it folles an optional expression

grep '^colou*r$' english
color
colour

Escaping

The characters ., *, [, ], ^, $ are special

They are called meta-characters

How can we look for them?

To take out the “superpowers”, we use \

\., \*, \[, \], \^, $, and \\

The rest of the characters match themselves

Exercise

Look for “Holmes.”

Extended regular expressions

Instead of grep we will use egrep

Now the characters .?*[]^${}()+|\ are special

As before, we can always escape them

Zero on One time

? is like * but means “zero or one time”

egrep --only 'lo?k' english |sort |uniq
lk
lok
egrep --only 'lo*k' english |sort |uniq
lk
lok
look

One or more times

+ means one or more times

That is, [a-z][a-z]* is the same as [a-z]+

Controlling the number of times

We can use curly braces to repeat something between a range of times:

^a{3,5}$

That will match the letter “a” repeated 3, 4, or 5 times.

Controlling the number of times

If you want to match something repeated up to a certain number of times, you can use 0 as the first number.

If you want to match something more than a certain number with no maximum, you can just leave the second number blank:

^a{3,}$

Alternatives

If you want to match two different expressions, you can use |

egrep 'cat|dog' english
Alcatraz    Yucatan     adjudicate  advocate
Decatur     abdicate    adjudicated advocated
Hecate      abdicated   adjudicates advocates
Ladoga      abdicates   adjudicating    advocating
Mercator    abdicating  adjudication    allocate
Muscat      abdication  adjudicator allocated
Popocatepetl    abdications adjudicators    allocates

Look for cats and dogs on all the text files

Grouping

We can use ( and ) to define groups of expressions

egrep --only '([aeiou][^aeiou]){2}' english |sort |uniq
aliy    uran    arer    ured    alit    ole'    amat    edim    erat    uter
aron    urim    arin    ures    aliz    olic    elod    edom    emun    uper
asid    urin    aris    urin    anim    itic    amat    eter    erat    uter
elar    urit    ines    umin    ated    itiv    ical    edom    emun    uper
ilen    anis    aten    umul    ates    ilat    elod    eter    erat    uter
aham    urit    ater    ativ    atin    enat    elon    edom    emun    uper
alom    anis    atif    umul    ator    ened    ane'    eter    erat    ivit
uja'    urit    icat    ativ    idis    enin    anes    ekab    emun    uper
apul    anis    atif    umul    ilat    oses    eme'    eked    erat    ivit
ure'    urus    icat    ifor    eral    osis    emen    ekin    enal    uper