Linux for Data Scientists

The first step towards becoming a data scientist is to become familiar with Linux. EdX offered a great introductory course by the Linux Foundation, which covers basic to intermediary materials.

Important topics include:

  • Linux philosophy and concepts
  • Command line operators (basic operations and working with files)
  • File operations
  • User environment
  • Text editors (vi/vim and emacs)
  • Text manipulation (cat, echo, sed, awk, grep, tr, wc, cut)
  • Bash scripting
  • Security, networks, processes

Completing the course would give you a decent command of most command line utilities that are used on a almost daily basis.

Basic commands: man, ls, mkdir, rmdir, rm -rf, file, ln, echo

Working with text files:

  • cat – concatenate and print content of file e.g. cat filename
  • head – print first 10 lines of a file e.g. head -n x filename, x = # lines to show
  • tail – print last 10 lines of a file e.g. tail -n x filename, x = #lines to show
  • less, more – inspect file content without printing out to standard output
  • wc – word, line, character, and byte count e.g. wc -l filename
  • grep – search for pattern in a text file (regexp is supported) e.g. grep pattern filename; common options are -i (ignore case), -F (search for fixed string), -m n (show max n results), -c (count only, do not print matching text), -C n (print n leading and trailing lines surrounding each match
  • tr – translate (replace or substitute characters from a file) e.g. tr ’01’ ‘,’
  • sed – stream editor to transform text e.g. sed ‘s/apple/orange/g; s/orange/pear/g’ — this first replaces all (g for global) occurrences of apple with orange, then replaces all occurrences of orange with pear.
  • cut – extract a field (column) of a file with table structure (i.e. each line contains a record and each record consists of multiple fields) e.g. cut -d : -f 2 extracts the second column of a file using : as the delimiter
  • paste – putting files together (horizontally or vertically) e.g. paste file1 file2
  • split – split a big file into smaller parts
  • sort – sort a file line by line (can also sort by field) e.g. sort <filename, cat filename | sort -c
  • uniq – remove all but one line of duplicate lines from a already sorted line e.g. sort filename | uniq -c, use option -c to also prints the count of each instances
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s