Class 3: Folders and Files

Computing for Molecular Biology 1

Andrés Aravena

19 October 2020

Structure of secondary memory

Structure of secondary memory

The disks store a huge amount of data

To organize it we use files

To organize the files we use folders
also called directories

Folders, also called Directories

You probably know about computer folders

They are an example of hierarchical structure

Key idea:

  • you work in the current directory
  • you can change your current directory
  • be aware of what is your current directory

Files

Like the main memory, a file is just a list of bytes

The meaning of the file depends on the context

Usually, the name of the file suggests a context

For example, an MP3 file is probably audio

File attributes

Besides the data itself, files have metadata

That is, data about the data. For example

  • Files have a name
  • Files have a modification date, maybe other dates too
  • Files have a size

You should learn how to read them

File names

The names of the files are “words”:

  • a series of letters, numbers and some symbols

  • Technically, a filename is a string or list of characters

Maximum length of a filename is 250 characters

File names

You can use

  • English letters (A-Z, a-z),
  • numbers (0-9),
  • some symbols: ., -,   and _

You cannot use any of these symbols:
/, :, +, |, <, *, >, " and '

You can use   (space) and non-english letters (like ǧ or ñ)
but I recommend not to use them, because they may cause problems

File names

In some systems small caps and BIG CAPS are not equivalent

For example HOMEWORK.txt and homework.txt are different

Be careful. Be systematic and coherent:

  • Use always the same name
  • Easy way: use only lower case

File extensions

If the filename includes ., the text after it is called extension

In Microsoft Windows® extensions are usually 3 letters

For example

  • EXE
  • JPG
  • DOC
  • XLS
  • TXT
  • CSV
  • MP4

File extensions

It is a suggestion on how to interpret the file

  • Images: JPG, PNG, GIF, TIFF
  • Movies: AVI, MP4, MOV
  • Audio: WAV, MP3
  • Documents: DOC, DOCX, PDF
  • Genomic data: GBK, FNA, FAA
  • Programs: EXE, APP

Kinds of file

It is useful to separate computer files in two groups:

Text Files
each byte is a character, we can read it
Binary Files
bytes are grouped in binary numbers, representing images, sounds, etc.

Content of a binary file

binary file

Content of a text file

text file

Binary files are only for computers

  • It is very hard to understand a binary file without a computer

  • It can only be read by the program that made it

  • Most of these programs are private

  • If the company goes out of business, you lose your data

  • New versions of the program may not read the old files

Text files are for humans and computers

  • Binary files are hard to read
    • unless you have the correct program
  • Text files can be read by humans
    • Each byte is a letter
  • Text files can be read by computers
    • Data must be recyclable
    • The output of one program may be the input of another program

Text files are for ever

Free

  • nothing to pay
  • you can do whatever you want

Never get obsolete

Word® files are not text files

(doc or docx)

You shall not use Microsoft Word® to handle data

Text Files

  • are universal
  • are easy to read and write from a program
  • do not have any style like bold or italic
  • are like books without figures

The natural way to represent a text document is to encode each letter with a single byte

There is a basic standard for English, called ASCII

ASCII code

Each letter is coded by a number

30 40 50 60 70 80 90 100 110 120
0 ( 2 < F P Z d n x
1 ) 3 = G Q [ e o y
2 4 > H R \ f p z
3 ! + 5 ? I S ] g q {
4 " , 6 @ J T ^ h r |
5 # - 7 A K U i s }
6 $ . 8 B L V ` j t ~
7 % / 9 C M W a k u
8 & 0 : D N X b l v
9 ´ 1 ; E O Y c m w

ASCII code

Each number from 0 to 127 is either a symbol or a special signal

  • New Line
  • End of Message
  • Tab
  • Space
  • Backspace

Non-English languages use numbers between 128 and 255 for symbols like “Ç”, “Ö”, “É”, “Ñ”

To learn more

https://en.wikipedia.org/wiki/Binary_file https://en.wikipedia.org/wiki/Text_file https://en.wikipedia.org/wiki/Directory_(computing) https://en.wikipedia.org/wiki/List_of_file_formats

Questions?

Homework: Install R 4.0 in your computer

For this course we will use the new version of R and Rstudio. These two tools work together. Install R first, then install Rstudio.

These videos may help you

Now

Fill the survey at the course homepage

Visit dry-lab.org/blog/2020/cmb1 and fill the survey