FOR
BEGINNERS

Dr. Ajay Kumar Koli, PhD \(\cdot\) SARA Institute of Data Science \(\cdot\) India

Hello Everyone!!

About SARA




  • SARA stands for Savitribai Ramabai Institute of Data Science.

  • It is a Charitable Education Trust, est. in 2023 by Dr. Ajay Kumar Koli & Dr. Kiran Lata Koli.

  • Our mission to enhance the representation of under-privileged communities in the field of data science.

Purpose


To share reasons with you (and hopefully convince you) why you should learn data science tools like R, python, GitHub, quarto etc.

Work

We think we work like this but …

Work

This is our work look actually.

🤯 Work Flowchart

MS Word

Excel

PPTs

Other Tools:

  • PDF

  • SPSS

  • SAS

  • STATA

  • Reference management

Work Influencer



👀 Focus on Content.


💅 Set Yourself Apart.

Data Science


“Data science is an exciting discipline that allows you to transform raw data into understanding, insight, and knowledge.”

Career in Data Science


“You don’t need to be an expert programmer to be a successful data scientist.”

Types of Data Roles

Data Science Process



Table of Content



R and
RStudio














R Programming Language




“R is a free software environment for statistical computing and graphics.”

History of R

  • Initially developed as S language by Bells Labs.

  • First appeared in August 1993 as R language by:

Ross Ihaka
(New Zealand Statistician)

Robert Gentleman
(Canadian Statistician)

R is FREE

Download R from CRAN

R Console


  • R version 4.3.1 (2023-06-16)

  • R name “Beagle Scouts”

  • R licence “ABSOLUTELY NO WARRANTY”

  • R prompt >|

Workspace Image


  • Don’t save workspace image.

  • It helps in “freshly minted R sessions”.

  • “put more trust in your script than in your memory”.

OPERATORS

Operators

“Operators are used to perform operations on variables and values.”


12 + 3 in this code + is an operator.


Tip

Put spaces between and around operators (=+-*/)

R Arithmetic Operators

Arithmetic operators are used with numeric values to perform common mathematical operations:


Operator Name Example
+ Addition x + y
- Subtraction x - y
* Multiplication x * y
/ Division x / y
^ Exponent x ^ y

R Console


Code

7

Output

[1] 7

R Console: Addition


Code

2 + 1

Output

[1] 3

R Console: Subtraction


Code

10 - 2

Output

[1] 8

R Console: Multiplication


Code

12 * 4

Output

[1] 48

R Console: Division


Code

25 / 5

Output

[1] 5

R Comparison Operators

Comparison operators are used to compare two values:


Operator Name Example
== Equal x == y
!= Not equal x != y
> Greater than x > y
< Less than x < y
>= Greater than or equal to x >= y
<= Less than or equal to x <= y

R Console: Logic


Code

4 == 5

Output

[1] FALSE

R Console: Logic


Code

67 > 60

Output

[1] TRUE

R Console


Code

3434 + 343453 * 2323 / 534 - 1000

Output

[1] 1496519



Important

R follows the BODMAS (bracket, order, division, multiplication, addition and subtraction) rule to solve mathematical equations.

R Console


Code

12:18

Output

[1] 12 13 14 15 16 17 18



Important

R Miscellaneous Operators: Miscellaneous operators are used to manipulate data.

Operator Description Example
: Creates a series of numbers in a sequence 1:10

Plot Using R

plot(1:100)

😏 That’s Okay But How To

  • combine plot, text, tables and images in a single file.

  • publish my work online or convert into a word, pdf or html file.

  • work efficiently with my different projects and save, share and track them.

🔥 WE NEED A SUPERHERO 🔥

S T U D I O

posit, earlier RStudio


  • RStudio is an integrated development environment (IDE) for R and Python.

  • As per posit, RStudio is “the most trusted IDE for open source data science”.

  • Download RStudio.

RStudio IDE

RStudio IDE

“It includes a console, syntax-highlighting editor that supports direct code execution, and tools for plotting, history, debugging, and workspace management.”

RStudio \(\rightarrow\) Tools \(\rightarrow\) Global Options

RStudio \(\rightarrow\) Tools \(\rightarrow\) Global Options

R & RStudio


Imagine R as a powerful engine


and RStudio as a stylish car


PROJECT

Open RStudio

RStudio Without Project

RStudio Without Project

RStudio Project Helps:

  • “to divide your work into multiple contexts, each with their own”
    • working directory,
    • workspace,
    • history, and
    • source documents.”

⛷️ Create
RStudio Project
in 4 Steps ⛷️

Create RStudio Project

Create RStudio Project

In Case Anything Goes Wrong\(...\)

Create RStudio Project

Create RStudio Project

Create RStudio Project

Create RStudio Project

Create RStudio Project

RStudio Project “name”

RStudio Project “path”

RStudio Project

Artwork by Alision Horst

Write R Codes in

R Console

“The code input and output are in the R console”

Write R Codes in

R script using RStudio.

Write R Codes in

Quarto document using RStudio

:::

R Script (.R)

Write codes in the R script \(\rightarrow\) Console will show the results.

  • Benefits of writing codes in R script:
    • You can save it for later use and revision.
    • You can share with others.
    • A better track of codes.

💡 Tips for R Script

  1. Writing readable code because other people might need to use your code.

  2. Writing readable code because you might need to use your code, a few weeks/months/years after you’ve written it.

  3. Put spaces between and around variable names and operators (=+-*/).

  4. Break up long lines of code.

  5. Keeping a consistent style.

FUNCTION

Artwork by Alision Horst

R Function

  • “A function, in a programming environment, is a set of instructions.”

  • “A programmer builds a function to avoid repeating the same task, or reduce complexity.”

R Function

round(24.3454, 3)


round(argument 1, argument 2)

[1] 24.345

Structure of R Function


Structure of R Function


Structure of R Function


Structure of R Function


Structure of R Function


Round Function

Function with default argument.

round(46.487)

[1] 46

Round Function

Function with a specific value of an argument.

round(x = 46.587, digits = 2)

[1] 46.59

Square Root Function

Function with a specific value of an argument.

sqrt(x = 9)

[1] 3

Sequence Function

Function with a specific value of an argument.

seq.int(from = 10, to = 30, by = 5)

or

seq.int(from = 10,
        to = 30,
        by = 5)

[1] 10 15 20 25 30

COMMENT

Artwork by Alision Horst

Comment:

  • “Humans will be able to read the comments, but your computer will pass over them.”

  • In R, # is used as a commenting symbol.

PACKAGES

R Packages

“An R package is a collection of functions, data, and documentation that extends the capabilities of base R. Using packages is key to the successful use of R.”1

R Packages

Install Packages

Name of the Packages

Installed Packages

Function to Install Packages


install.packages("tidyverse")

Function to Call Package


Using Packages

  • You need to install package only once like:

    • 📚 We buy books once and use them again and again

    • 💡 Fix the bulb once and use it again and again.

Using Packages

  • In every R document you need to call once the package using function library(), for example library(ggplot2).

  • Once in a while, you need to update the installed packages as well.

  • If you un-install R or RStudio, you will lose all installed packages.

Tools \(\rightarrow\) Package Updates

Select Packages to Update

Click Install Updates

To Remove Packages

OBJECTS

Artwork by Alision Horst

R Object


Just a name that you can use to call up stored data.


Important

R assignment operators: Assignment operators are used to assign values to variables.

my_var <- 3

my_var # print my_var

Create Object

age <- c(23, 25, 16, 40, 34)

age

[1] 23 25 16 40 34

Create Object

income <- c(23000, 25000, 16000, 4000, 34000)

income

[1] 23000 25000 16000 4000 34000

Create Object

name <- c("Bhim", "Rama", "Sara", "Phule", "Savitri")

name

[1] “Bhim” “Rama” “Sara” “Phule” “Savitri”

Create Object

place <- c("MH", "RJ", "DL", "HR", "HR")

place

[1] “MH” “RJ” “DL” “HR” “HR”

Create Object

weight_kg <- c(50, 52, 61, 40, 70)

weight_kg

[1] 50 52 61 40 70

💡Guidelines to Name R Objects:

  • a name cannot start with a number.

  • a name cannot use some special symbols, like ^, !, $, @, +, -, /, or *,:.

  • avoid caps.

  • avoid space.

  • use dash (like weight-kg) or underscore (like weight_kg).

  • if chronology matters then add date (2020-09-05-file-name).

RStudio Environment Window


🤔 How to combine all these objects and form a data set?

👇 Something Like This 😻😻


     name income age place weight_kg
1    Bhim  23000  23    MH        50
2    Rama  25000  25    RJ        52
3    Sara  16000  16    DL        61
4   Phule   4000  40    HR        40
5 Savitri  34000  34    HR        70

How to Create a Data Object?

example_df <- data.frame(name, income, age, place, weight_kg)

example_df
     name income age place weight_kg
1    Bhim  23000  23    MH        50
2    Rama  25000  25    RJ        52
3    Sara  16000  16    DL        61
4   Phule   4000  40    HR        40
5 Savitri  34000  34    HR        70

Export Data as a csv File

library(readr)

# create a folder in wd & name it "data"
write_csv(example_df, "data/example_df.csv") 

To see the created file, check the “data” folder in your working directory.

List of All Objects

[1] “age” “example_df” “has_annotations” “income”
[5] “name” “place” “weight_kg”

COMMUNITY

Artwork by Alision Horst

Help Using Console >

in console type ?your query

RStudio: Package Website


Posit Community

Stack Overflow

GitHub

Social Media #RStats

Artwork by Alision Horst

PUBLISH
USING
QUARTO


QUARTO

Quarto is the Next Generation of R Markdown




Quarto

“An open-source scientific and technical publishing system”

Quarto can produce a wide variety of output formats:

Articles & Reports

Presentations

Interactive Docs

Websites

Books

Analyze. Share. Reproduce. You have a story to tell with data — tell it with Quarto.

Download Quarto

Get Started: Choose IDE

Create a New Quarto Document

File \(\rightarrow\) New File \(\rightarrow\) Quarto Document

New Quarto Document

New Quarto Document

New Quarto Document

New Quarto Document

New Quarto Document

New Quarto Document

New Quarto Document

New Quarto Document

Save Quarto Document

Save Quarto Document

Process When You Render the Quarto Document

Quarto sends the .qmd file to knitr, which executes all of the code chunks and creates a new markdown (.md) document which includes the code and its output. The markdown file generated by knitr is then processed by pandoc, which is responsible for creating the finished file.

Process When You Render the Quarto Document


Source Editor vs. Visual Editor


Visual Editor

Visual Editor

Visual Editor

Source Editor

MARKDOWN

Text formatting


Markdown Syntax Output
normal
normal
*italics*
italics
**bold**
bold
***bold italics***
bold italics

Text formatting


Markdown Syntax Output
superscript^2^
superscript2
subscript~2~
subscript2
~~strike through~~
strike through
`verbatim code`
verbatim code

Headings


Markdown Syntax Output
# Header 1

Header 1

## Header 2

Header 2

### Header 3

Header 3

#### Header 4

Header 4

##### Header 5
Header 5
###### Header 6
Header 6

Add Images

If image is saved in your computer,
![](add image path here)


Markdown Syntax Output
![](rose.jpg)

Add Images

If image is taken from the internet,
![](add image link here)


![Smile everyday](https://images.unsplash.com/photo-1627130595904-ebeeb6540a93?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D)

Unordered List


Markdown Syntax Output
* Item 1
* Item 2
* Item 3
  • Item 1
  • Item 2
  • Item 3

Unordered List: Sub-items


Markdown Syntax Output
* Main items
    + Sub-item 1
    + Sub-item 2
        - Sub-sub-item 1
  • Main items
    • Sub-item 1
    • Sub-item 2
      • Sub-sub-item 1

Ordered List


Markdown Syntax Output
1. Eggs
1. Tea
1. Fish
1. Milk
  1. Eggs
  2. Tea
  3. Fish
  4. Milk

List


Markdown Syntax Output
(@)  A list whose numbering

continues after

(@)  an interruption
  1. A list whose numbering

continues after

  1. an interruption

List


Markdown Syntax Output
::: {}
1. A list
:::

::: {}
1. Followed by another list
:::
  1. A list
  1. Followed by another list

Definition

term
: definition


Markdown Syntax Output
Power
: Power is power.
Power
Power is power.

Equations

Use $ delimiters for inline math.


Markdown Syntax Output
It is a great equation $E = mc^{2}$
It is a great equation \(E=mc^{2}\)

Equations

Use $$ delimiters for display math.


Markdown Syntax Output
It is a great equation $$E = mc^{2}$$
It is a great equation \[E=mc^{2}\]

In-line Coding

`r `


Code

`r 1+1`

Output

2

Videos

You can include videos in documents using the
{{< video >}} short code.


Code

{{< video https://www.youtube.com/embed/wo9vZccmqwc >}}

Output

Tables


Markdown Syntax

| Right | Left | Default | Center |
|------:|:-----|---------|:------:|
|   12  |  12  |    12   |    12  |
|  123  |  123 |   123   |   123  |
|    1  |    1 |     1   |     1  |

Output

Right Left Default Center
12 12 12 12
123 123 123 123
1 1 1 1

Know Your Data





“Happy families are all alike; every unhappy family is unhappy in its own way.” — Leo Tolstoy

Know Your Data

example_df
     name income age place weight_kg
1    Bhim  23000  23    MH        50
2    Rama  25000  25    RJ        52
3    Sara  16000  16    DL        61
4   Phule   4000  40    HR        40
5 Savitri  34000  34    HR        70

Data: Variables


     name income age place weight_kg
1    Bhim  23000  23    MH        50
2    Rama  25000  25    RJ        52
3    Sara  16000  16    DL        61
4   Phule   4000  40    HR        40
5 Savitri  34000  34    HR        70

Data: Observations


     name income age place weight_kg
1    Bhim  23000  23    MH        50
2    Rama  25000  25    RJ        52
3    Sara  16000  16    DL        61
4   Phule   4000  40    HR        40
5 Savitri  34000  34    HR        70

Data: Values


     name income age place weight_kg
1    Bhim  23000  23    MH        50
2    Rama  25000  25    RJ        52
3    Sara  16000  16    DL        61
4   Phule   4000  40    HR        40
5 Savitri  34000  34    HR        70

Column/Variable/Data Types

  • chr = character, example "A"

  • int = integer, example 1L

  • dbl = double, example 1.5

  • lgl = logical, example TRUE

  • fct = factor, example factor("A")

Numeric Value Types


Data Summary

summary(example_df)
     name               income           age          place          
 Length:5           Min.   : 4000   Min.   :16.0   Length:5          
 Class :character   1st Qu.:16000   1st Qu.:23.0   Class :character  
 Mode  :character   Median :23000   Median :25.0   Mode  :character  
                    Mean   :20400   Mean   :27.6                     
                    3rd Qu.:25000   3rd Qu.:34.0                     
                    Max.   :34000   Max.   :40.0                     
   weight_kg   
 Min.   :40.0  
 1st Qu.:50.0  
 Median :52.0  
 Mean   :54.6  
 3rd Qu.:61.0  
 Max.   :70.0  

Data Summary

str(example_df)
'data.frame':   5 obs. of  5 variables:
 $ name     : chr  "Bhim" "Rama" "Sara" "Phule" ...
 $ income   : num  23000 25000 16000 4000 34000
 $ age      : num  23 25 16 40 34
 $ place    : chr  "MH" "RJ" "DL" "HR" ...
 $ weight_kg: num  50 52 61 40 70

Data Summary

glimpse(example_df)
Rows: 5
Columns: 5
$ name      <chr> "Bhim", "Rama", "Sara", "Phule", "Savitri"
$ income    <dbl> 23000, 25000, 16000, 4000, 34000
$ age       <dbl> 23, 25, 16, 40, 34
$ place     <chr> "MH", "RJ", "DL", "HR", "HR"
$ weight_kg <dbl> 50, 52, 61, 40, 70

Data Summary

library(skimr)

skim(example_df)
Data summary
Name example_df
Number of rows 5
Number of columns 5
_______________________
Column type frequency:
character 2
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 4 7 0 5 0
place 0 1 2 2 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
income 0 1 20400.0 11193.75 4000 16000 23000 25000 34000 ▃▃▁▇▃
age 0 1 27.6 9.45 16 23 25 34 40 ▃▇▁▃▃
weight_kg 0 1 54.6 11.39 40 50 52 61 70 ▃▇▁▃▃

palmerpenguins

R package name palmerpenguins & dataset name is penguins more information here.


Data Summary

library(palmerpenguins) #do not forget to call the pkg

glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

🧠 YOUR TURN



How to summarize the penguins data using the summary(), skim(), & str() functions?

05:00

DATA
VISUALIZATION









“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey

Goal

How to visualize data using R package ggplot2.


ggplot2 Layers

Import Data

Figure 1

Map Variables Aesthetics

ggplot(data = penguins,
       mapping = aes(x = species))
Figure 2

Add Geometric Shapes

ggplot(data = penguins,
       mapping = aes(x = species)) +
  geom_bar()
Figure 3

Key Components are:


  1. data,

  2. A set of aesthetic mappings between variables in the data and visual properties, and

  3. At least one layer which describes how to render each observation. Layers are usually created with a geom function.

🧠 YOUR TURN

ggplot(data = penguins,
       mapping = aes(x = island)) +
  geom_bar()
05:00

Common Mistakes

  • Make sure that every ( is matched with a ) and every " is paired with another ".

  • Console shows no results but a + sign that means your code is incomplete and R is waiting for you to complete the code.

  • in ggplot + has to come at the end of the line, not the start

“Fill” Color

ggplot(data = penguins,
       mapping = aes(x = species)) +
  geom_bar(fill = "blue")
Figure 4

“Fill” Colors

ggplot(data = penguins,
       mapping = aes(x = species)) +
  geom_bar(fill = c("blue", "green", "yellow"))
Figure 5

“Fill” & “Color” Colors

ggplot(data = penguins,
       mapping = aes(x = species)) +
  geom_bar(fill = c("blue", "green", "yellow"),
           color = "black")
Figure 6

🧠 YOUR TURN

ggplot(data = penguins,
       mapping = aes(x = island)) +
  geom_bar(fill = c("red", "yellow", "darkgreen"),
           color = "black")
05:00

Know Your Data


glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Plot A Continuous Variable

# bill_length_mm is dbl type variable/column

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm)) +
  geom_histogram()
Figure 7

🧠 YOUR TURN

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm)) +
  geom_histogram(fill = "darkblue",
                 color = "white")
05:00

Two Continuous Variables

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point()
Figure 8

Geom Size

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(size = 5)
Figure 9

Geom Shape

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(size = 5,
             shape = 8)
Figure 10

🧠 YOUR TURN

ggplot(data = penguins,
       mapping = aes(x = body_mass_g, y = flipper_length_mm)) +
  geom_point(size = 2, shape = 23, color = "red", fill = "gold")
05:00

Plot A Factor & Factor

Sometimes, we want to differentiate values of a factor/category variable on the basis of another factor/category variable.

ggplot(data = penguins,
       mapping = aes(x = island)) +
  geom_bar(aes(fill = sex))
Figure 11

Plot A Factor & Continuous

Sometimes, we want to differentiate values from a continuous variable on the basis of factor/category variables.

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm)) +
  geom_histogram(aes(fill = sex), color = "black")
Figure 12

A Factor & Two Cont.

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = sex))
Figure 13

A Factor & Two Cont.

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species))
Figure 14

Write Labels

  • Title of the plot

  • Subtitle of the plot with more information

  • Title of the x-axis

  • Title of the y-axis

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species)) +
  labs(
    title = "The title of the plot",
    subtitle = "The subtitle of the plot",
    x = "Bill length (mm)",
    y = "Bill depth (mm)"
  )
Figure 15

Different Shapes

Each level of the factor/category can be shown using a different shape of different color.

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species, shape = species)) +
  labs(
    title = "The title of the plot",
    subtitle = "The subtitle of the plot",
    x = "Bill length (mm)",
    y = "Bill depth (mm)"
  )
Figure 16

Various Themes

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species, shape = species)) +
  labs(
    title = "The title of the plot",
    subtitle = "The subtitle of the plot",
    x = "Bill length (mm)",
    y = "Bill depth (mm)"
  ) +
  theme_economist()
Figure 17

Various Themes

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species, shape = species)) +
  labs(
    title = "The title of the plot",
    subtitle = "The subtitle of the plot",
    x = "Bill length (mm)",
    y = "Bill depth (mm)"
  ) +
  theme_solarized_2()
Figure 18

Various Themes

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species, shape = species)) +
  labs(
    title = "The title of the plot",
    subtitle = "The subtitle of the plot",
    x = "Bill length (mm)",
    y = "Bill depth (mm)"
  ) +
  theme_tufte()
Figure 19

Various Themes

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species, shape = species)) +
  labs(
    title = "The title of the plot",
    subtitle = "The subtitle of the plot",
    x = "Bill length (mm)",
    y = "Bill depth (mm)"
  ) +
  theme_clean()
Figure 20

Color Palette

Color Palette

R package ggthemes have function to use color scheme for colorblindness. Know more

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species, shape = species)) +
  labs(
    title = "The title of the plot",
    subtitle = "The subtitle of the plot",
    x = "Bill length (mm)",
    y = "Bill depth (mm)"
  ) +
  theme_clean() +
  scale_color_colorblind()
Figure 21

Color Palette

library(RColorBrewer)
ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species, shape = species)) +
  labs(
    title = "The title of the plot",
    subtitle = "The subtitle of the plot",
    x = "Bill length (mm)",
    y = "Bill depth (mm)"
  ) +
  theme_clean() +
  scale_color_brewer(palette = "Dark2")
Figure 22

Color Palette

library(wesanderson)

names(wes_palettes)
 [1] "BottleRocket1"     "BottleRocket2"     "Rushmore1"        
 [4] "Rushmore"          "Royal1"            "Royal2"           
 [7] "Zissou1"           "Zissou1Continuous" "Darjeeling1"      
[10] "Darjeeling2"       "Chevalier1"        "FantasticFox1"    
[13] "Moonrise1"         "Moonrise2"         "Moonrise3"        
[16] "Cavalcanti1"       "GrandBudapest1"    "GrandBudapest2"   
[19] "IsleofDogs1"       "IsleofDogs2"       "FrenchDispatch"   
[22] "AsteroidCity1"     "AsteroidCity2"     "AsteroidCity3"    
ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species, shape = species)) +
  labs(
    title = "The title of the plot",
    subtitle = "The subtitle of the plot",
    x = "Bill length (mm)",
    y = "Bill depth (mm)"
  ) +
  theme_clean() +
  scale_color_manual(values = wes_palette("BottleRocket2", n = 3))
Figure 23

Export Plot

Export/save plot as pdf, jpg or png file.

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species, shape = species)) +
  labs(
    title = "The title of the plot",
    subtitle = "The subtitle of the plot",
    x = "Bill length (mm)",
    y = "Bill depth (mm)"
  ) +
  theme_clean() +
  scale_color_manual(values = wes_palette("BottleRocket2", n = 3))

ggsave("penguins-plot.pdf")
Figure 24

🧑🏽‍💻👨🏽‍💻
Question & Answer



DATA
WRANGLING





Data Wrangling

Transforming Data

  • “narrowing in on observations of interest …

  • creating new variables that are functions of existing variables and …

  • calculating a set of summary statistics.”











R Package dplyr

dplyr Package

  • “dplyr is a grammar of data manipulation”

  • “providing a consistent set of verbs that help you solve the most common data manipulation challenges:”

dplyr Functions:

filter() Function:

Picks cases/observations based on their values.

filter() Function

How to have a data of only Gentoo penguins?

library(tidyverse)
library(palmerpenguins)
library(countdown)

penguins |> 
  filter(species == "Gentoo")
# A tibble: 124 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo  Biscoe           46.1          13.2               211        4500
 2 Gentoo  Biscoe           50            16.3               230        5700
 3 Gentoo  Biscoe           48.7          14.1               210        4450
 4 Gentoo  Biscoe           50            15.2               218        5700
 5 Gentoo  Biscoe           47.6          14.5               215        5400
 6 Gentoo  Biscoe           46.5          13.5               210        4550
 7 Gentoo  Biscoe           45.4          14.6               211        4800
 8 Gentoo  Biscoe           46.7          15.3               219        5200
 9 Gentoo  Biscoe           43.3          13.4               209        4400
10 Gentoo  Biscoe           46.8          15.4               215        5150
# ℹ 114 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Wait! What the f* is |>

  • This is called native pipe operator

  • |> let you “pipe” an object forward to a function or call expression

  • allowing you to express a sequence of operations that transform an object.

  • ctrl + shift + m = |>

filter() Function

How to have a data of penguins of bill length more than 43 mm?

penguins |> 
  filter(bill_length_mm > 43)
# A tibble: 188 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           46            21.5               194        4200
 2 Adelie  Dream               44.1          19.7               196        4400
 3 Adelie  Torgersen           45.8          18.9               197        4150
 4 Adelie  Dream               43.2          18.5               192        4100
 5 Adelie  Biscoe              43.2          19                 197        4775
 6 Adelie  Biscoe              45.6          20.3               191        4600
 7 Adelie  Torgersen           44.1          18                 210        4000
 8 Adelie  Torgersen           43.1          19.2               197        3500
 9 Gentoo  Biscoe              46.1          13.2               211        4500
10 Gentoo  Biscoe              50            16.3               230        5700
# ℹ 178 more rows
# ℹ 2 more variables: sex <fct>, year <int>

🧠 YOUR TURN

  1. How to have a data of only Adele penguins?

  2. How to have a data of penguins of bill depth more than 10 mm?

# question number 1
penguins |> 
  filter(species == "Adelie")

# question number 2
penguins |> 
  filter(bill_depth_mm > 10)
# A tibble: 152 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 142 more rows
# ℹ 2 more variables: sex <fct>, year <int>
# A tibble: 342 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           39.2          19.6               195        4675
 8 Adelie  Torgersen           34.1          18.1               193        3475
 9 Adelie  Torgersen           42            20.2               190        4250
10 Adelie  Torgersen           37.8          17.1               186        3300
# ℹ 332 more rows
# ℹ 2 more variables: sex <fct>, year <int>
10:00

filter() Function

How to have a data of Gentoo penguins of bill length more than 50 mm?

penguins |> 
  filter(species == "Gentoo",
         bill_length_mm > 50)
# A tibble: 22 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo  Biscoe           50.2          14.3               218        5700
 2 Gentoo  Biscoe           59.6          17                 230        6050
 3 Gentoo  Biscoe           50.5          15.9               222        5550
 4 Gentoo  Biscoe           50.5          15.9               225        5400
 5 Gentoo  Biscoe           50.1          15                 225        5000
 6 Gentoo  Biscoe           50.4          15.3               224        5550
 7 Gentoo  Biscoe           54.3          15.7               231        5650
 8 Gentoo  Biscoe           50.7          15                 223        5550
 9 Gentoo  Biscoe           51.1          16.3               220        6000
10 Gentoo  Biscoe           52.5          15.6               221        5450
# ℹ 12 more rows
# ℹ 2 more variables: sex <fct>, year <int>

filter() Function

How to have a data of non-Gentoo penguins of bill length more than 50 mm and weight more than 4 kg?

penguins |> 
  filter(species != "Gentoo",
         bill_length_mm > 50,
         body_mass_g > 4000)
# A tibble: 11 × 8
   species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
 1 Chinstrap Dream            52            18.1               201        4050
 2 Chinstrap Dream            50.5          19.6               201        4050
 3 Chinstrap Dream            52            19                 197        4150
 4 Chinstrap Dream            52.8          20                 205        4550
 5 Chinstrap Dream            54.2          20.8               201        4300
 6 Chinstrap Dream            51            18.8               203        4100
 7 Chinstrap Dream            52            20.7               210        4800
 8 Chinstrap Dream            53.5          19.9               205        4500
 9 Chinstrap Dream            50.8          18.5               201        4450
10 Chinstrap Dream            50.7          19.7               203        4050
11 Chinstrap Dream            50.8          19                 210        4100
# ℹ 2 more variables: sex <fct>, year <int>

🧠 YOUR TURN

How to have a data of penguins only from the Dream island which have bill depth more than 7 mm and weight more than 3 kg?

penguins |> 
  filter(island == "Dream",
         bill_depth_mm > 7,
         body_mass_g > 3000)
# A tibble: 118 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Adelie  Dream            39.5          16.7               178        3250
 2 Adelie  Dream            37.2          18.1               178        3900
 3 Adelie  Dream            39.5          17.8               188        3300
 4 Adelie  Dream            40.9          18.9               184        3900
 5 Adelie  Dream            36.4          17                 195        3325
 6 Adelie  Dream            39.2          21.1               196        4150
 7 Adelie  Dream            38.8          20                 190        3950
 8 Adelie  Dream            42.2          18.5               180        3550
 9 Adelie  Dream            37.6          19.3               181        3300
10 Adelie  Dream            39.8          19.1               184        4650
# ℹ 108 more rows
# ℹ 2 more variables: sex <fct>, year <int>
07:00

select() Function:

Picks variables/columns based on their names.

select() Function

How to keep only species variable in the data?

penguins |> 
  select(species)
# A tibble: 344 × 1
   species
   <fct>  
 1 Adelie 
 2 Adelie 
 3 Adelie 
 4 Adelie 
 5 Adelie 
 6 Adelie 
 7 Adelie 
 8 Adelie 
 9 Adelie 
10 Adelie 
# ℹ 334 more rows

select() Function

How to keep only bill related variables in the data?

penguins |> 
  select(bill_length_mm, bill_depth_mm)
# A tibble: 344 × 2
   bill_length_mm bill_depth_mm
            <dbl>         <dbl>
 1           39.1          18.7
 2           39.5          17.4
 3           40.3          18  
 4           NA            NA  
 5           36.7          19.3
 6           39.3          20.6
 7           38.9          17.8
 8           39.2          19.6
 9           34.1          18.1
10           42            20.2
# ℹ 334 more rows

🧠 YOUR TURN

How to have a data of variables sex, year, island and flipper length?

penguins |> 
  select(sex, year, island, flipper_length_mm)
# A tibble: 344 × 4
   sex     year island    flipper_length_mm
   <fct>  <int> <fct>                 <int>
 1 male    2007 Torgersen               181
 2 female  2007 Torgersen               186
 3 female  2007 Torgersen               195
 4 <NA>    2007 Torgersen                NA
 5 female  2007 Torgersen               193
 6 male    2007 Torgersen               190
 7 female  2007 Torgersen               181
 8 male    2007 Torgersen               195
 9 <NA>    2007 Torgersen               193
10 <NA>    2007 Torgersen               190
# ℹ 334 more rows
05:00

💡 Tips for variable selection

  1. Use names() function to see the exact names and the order of the variables.

  2. Use : operator to select the range of variables.

penguins |> 
  select(island : flipper_length_mm)
  1. Use location value of the variable.
penguins |> 
  select(3 : 7)
  1. Use - operator to not to select the range of variables.
# results: hide

penguins |> 
  select(-c(island : flipper_length_mm))
# A tibble: 344 × 4
   species body_mass_g sex     year
   <fct>         <int> <fct>  <int>
 1 Adelie         3750 male    2007
 2 Adelie         3800 female  2007
 3 Adelie         3250 female  2007
 4 Adelie           NA <NA>    2007
 5 Adelie         3450 female  2007
 6 Adelie         3650 male    2007
 7 Adelie         3625 female  2007
 8 Adelie         4675 male    2007
 9 Adelie         3475 <NA>    2007
10 Adelie         4250 <NA>    2007
# ℹ 334 more rows

🧠 YOUR TURN

How to have a data of variables from location first to fifth but without the variable island?

penguins |> 
  select(c(1, 3:5))
# A tibble: 344 × 4
   species bill_length_mm bill_depth_mm flipper_length_mm
   <fct>            <dbl>         <dbl>             <int>
 1 Adelie            39.1          18.7               181
 2 Adelie            39.5          17.4               186
 3 Adelie            40.3          18                 195
 4 Adelie            NA            NA                  NA
 5 Adelie            36.7          19.3               193
 6 Adelie            39.3          20.6               190
 7 Adelie            38.9          17.8               181
 8 Adelie            39.2          19.6               195
 9 Adelie            34.1          18.1               193
10 Adelie            42            20.2               190
# ℹ 334 more rows
05:00

mutate() Function:

Adds new variables that are functions of existing variables.

mutate() Function

How to convert body mass of penguins from grams to kilograms?

penguins |> 
  select(body_mass_g) |> 
  mutate(body_mass_kg = body_mass_g / 1000)
# A tibble: 344 × 2
   body_mass_g body_mass_kg
         <int>        <dbl>
 1        3750         3.75
 2        3800         3.8 
 3        3250         3.25
 4          NA        NA   
 5        3450         3.45
 6        3650         3.65
 7        3625         3.62
 8        4675         4.68
 9        3475         3.48
10        4250         4.25
# ℹ 334 more rows

mutate() Function

How to measure the penguin’s bill size using length and depth?

penguins |> 
  mutate(bill_size = bill_length_mm * bill_depth_mm) |> 
  select(bill_size)
# A tibble: 344 × 1
   bill_size
       <dbl>
 1      731.
 2      687.
 3      725.
 4       NA 
 5      708.
 6      810.
 7      692.
 8      768.
 9      617.
10      848.
# ℹ 334 more rows

🧠 YOUR TURN

How to convert the bill dimensions from mm to cm?

penguins |> 
  select(bill_length_mm, bill_depth_mm) |> 
  mutate(bill_length_cm = bill_length_mm / 10, bill_depth_mm / 10)
# A tibble: 344 × 4
   bill_length_mm bill_depth_mm bill_length_cm `bill_depth_mm/10`
            <dbl>         <dbl>          <dbl>              <dbl>
 1           39.1          18.7           3.91               1.87
 2           39.5          17.4           3.95               1.74
 3           40.3          18             4.03               1.8 
 4           NA            NA            NA                 NA   
 5           36.7          19.3           3.67               1.93
 6           39.3          20.6           3.93               2.06
 7           38.9          17.8           3.89               1.78
 8           39.2          19.6           3.92               1.96
 9           34.1          18.1           3.41               1.81
10           42            20.2           4.2                2.02
# ℹ 334 more rows
05:00

arrange() Function:

Changes the ordering of the rows.

arrange() Function

How to arrange data as per the bill length of the penguins?

penguins |> 
  arrange(bill_length_mm) #default is ascending order
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Dream               32.1          15.5               188        3050
 2 Adelie  Dream               33.1          16.1               178        2900
 3 Adelie  Torgersen           33.5          19                 190        3600
 4 Adelie  Dream               34            17.1               185        3400
 5 Adelie  Torgersen           34.1          18.1               193        3475
 6 Adelie  Torgersen           34.4          18.4               184        3325
 7 Adelie  Biscoe              34.5          18.1               187        2900
 8 Adelie  Torgersen           34.6          21.1               198        4400
 9 Adelie  Torgersen           34.6          17.2               189        3200
10 Adelie  Biscoe              35            17.9               190        3450
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

arrange() Function

How to see five penguins of the least bill length?

penguins |> 
  arrange(bill_length_mm) |> 
  head(5) 

#tail function to see the bottom of the data
# A tibble: 5 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Dream               32.1          15.5               188        3050
2 Adelie  Dream               33.1          16.1               178        2900
3 Adelie  Torgersen           33.5          19                 190        3600
4 Adelie  Dream               34            17.1               185        3400
5 Adelie  Torgersen           34.1          18.1               193        3475
# ℹ 2 more variables: sex <fct>, year <int>

🧠 YOUR TURN

How to see five penguins of the highest bill length?

penguins |> 
  arrange(bill_length_mm) |> 
  tail(5)
# A tibble: 5 × 8
  species   island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>     <fct>              <dbl>         <dbl>             <int>       <int>
1 Gentoo    Biscoe              55.9          17                 228        5600
2 Chinstrap Dream               58            17.8               181        3700
3 Gentoo    Biscoe              59.6          17                 230        6050
4 Adelie    Torgersen           NA            NA                  NA          NA
5 Gentoo    Biscoe              NA            NA                  NA          NA
# ℹ 2 more variables: sex <fct>, year <int>
05:00

summarise() Function:

Reduces multiple values down to a single summary.

summarise() Function

What is the mean bill length of penguins?

penguins |> 
  summarise(mean(bill_length_mm))
# A tibble: 1 × 1
  `mean(bill_length_mm)`
                   <dbl>
1                     NA

summarise() Function

What is the mean bill length of penguins after removing the missing values?

penguins |>
  drop_na() |> 
  summarise(mean(bill_length_mm))
# A tibble: 1 × 1
  `mean(bill_length_mm)`
                   <dbl>
1                   44.0

summarise() Function

What is the species wise mean bill length of penguins?

penguins |>
  drop_na() |> 
  group_by(species) |> 
  summarise(mean(bill_length_mm))
# A tibble: 3 × 2
  species   `mean(bill_length_mm)`
  <fct>                      <dbl>
1 Adelie                      38.8
2 Chinstrap                   48.8
3 Gentoo                      47.6

summarise() Function

What is the species wise mean bill length of penguins and total number of penguins in each specie?

penguins |>
  drop_na() |> 
  group_by(species) |> 
  summarise(mean(bill_length_mm),
            n = n())

# n() function to know the number of observations in the current group
# A tibble: 3 × 3
  species   `mean(bill_length_mm)`     n
  <fct>                      <dbl> <int>
1 Adelie                      38.8   146
2 Chinstrap                   48.8    68
3 Gentoo                      47.6   119

🧠 YOUR TURN

Who are of more weight male or female penguins?

penguins |>
  drop_na() |> 
  group_by(sex) |> 
  summarise(mean(body_mass_g),
            n = n())
# A tibble: 2 × 3
  sex    `mean(body_mass_g)`     n
  <fct>                <dbl> <int>
1 female               3862.   165
2 male                 4546.   168
05:00

References

Thank
You