FOR
BEGINNERS

Dr. Ajay Kumar Koli, PhD $\cdot$ SARA Institute of Data Science $\cdot$ India

Hello Everyone!!

Dr. Ajay Kumar Koli

Founder & Executive Director

SARA Institute of Data Science

@sara_institute

Slide source code

Star This Course on Github

About SARA

SARA stands for Savitribai Ramabai Institute of Data Science.
It is a Charitable Education Trust, est. in 2023 by Dr. Ajay Kumar Koli & Dr. Kiran Lata Koli.
Our mission to enhance the representation of under-privileged communities in the field of data science.

Purpose

To share reasons with you (and hopefully convince you) why you should learn data science tools like R, python, GitHub, quarto etc.

Work

We think we work like this but …

Work

This is our work look actually.

🤯 Work Flowchart

MS Word

Excel

PPTs

Other Tools:

PDF
SPSS
SAS
STATA
Reference management

Work Influencer

👀 Focus on Content.

💅 Set Yourself Apart.

Data Science

“Data science is an exciting discipline that allows you to transform raw data into understanding, insight, and knowledge.”

Career in Data Science

“You don’t need to be an expert programmer to be a successful data scientist.”

Types of Data Roles

Data Science Process

R and
RStudio

R Programming Language

“R is a free software environment for statistical computing and graphics.”

History of R

Initially developed as S language by Bells Labs.
First appeared in August 1993 as R language by:

Ross Ihaka
(New Zealand Statistician)

Robert Gentleman
(Canadian Statistician)

R is FREE

Download R from CRAN

R Console

R version 4.3.1 (2023-06-16)
R name “Beagle Scouts”
R licence “ABSOLUTELY NO WARRANTY”
R prompt >|

Workspace Image

Don’t save workspace image.
It helps in “freshly minted R sessions”.
“put more trust in your script than in your memory”.

OPERATORS

Operators

“Operators are used to perform operations on variables and values.”

12 + 3 in this code + is an operator.

Tip

Put spaces between and around operators (=+-*/)

R Arithmetic Operators

Arithmetic operators are used with numeric values to perform common mathematical operations:

Operator	Name	Example
`+`	Addition	x + y
`-`	Subtraction	x - y
`*`	Multiplication	x * y
`/`	Division	x / y
`^`	Exponent	x ^ y

R Console

Code

Output

[1] 7

R Console: Addition

Code

2 + 1

Output

[1] 3

R Console: Subtraction

Code

10 - 2

Output

[1] 8

R Console: Multiplication

Code

12 * 4

Output

[1] 48

R Console: Division

Code

25 / 5

Output

[1] 5

R Comparison Operators

Comparison operators are used to compare two values:

Operator	Name	Example
`==`	Equal	x == y
`!=`	Not equal	x != y
`>`	Greater than	x > y
`<`	Less than	x < y
`>=`	Greater than or equal to	x >= y
`<=`	Less than or equal to	x <= y

R Console: Logic

Code

4 == 5

Output

[1] FALSE

R Console: Logic

Code

67 > 60

Output

[1] TRUE

R Console

Code

3434 + 343453 * 2323 / 534 - 1000

Output

[1] 1496519

Important

R follows the BODMAS (bracket, order, division, multiplication, addition and subtraction) rule to solve mathematical equations.

R Console

Code

12:18

Output

[1] 12 13 14 15 16 17 18

Important

R Miscellaneous Operators: Miscellaneous operators are used to manipulate data.

Operator	Description	Example
`:`	Creates a series of numbers in a sequence	1:10

Plot Using R

Code
Output

plot(1:100)

😏 That’s Okay But How To

combine plot, text, tables and images in a single file.
publish my work online or convert into a word, pdf or html file.
work efficiently with my different projects and save, share and track them.

🔥 WE NEED A SUPERHERO 🔥

S T U D I O

posit, earlier RStudio

RStudio is an integrated development environment (IDE) for R and Python.
As per posit, RStudio is “the most trusted IDE for open source data science”.
Download RStudio.

RStudio IDE

“It includes a console, syntax-highlighting editor that supports direct code execution, and tools for plotting, history, debugging, and workspace management.”

RStudio $\rightarrow$ Tools $\rightarrow$ Global Options

R & RStudio

Imagine R as a powerful engine

and RStudio as a stylish car

PROJECT

Open RStudio

RStudio Without Project

RStudio Project Helps:

“to divide your work into multiple contexts, each with their own”
- working directory,
- workspace,
- history, and
- source documents.”

⛷️ Create
RStudio Project
in 4 Steps ⛷️

Create RStudio Project

In Case Anything Goes Wrong$...$

Create RStudio Project

RStudio Project “name”

RStudio Project “path”

RStudio Project

Artwork by Alision Horst

Write R Codes in

R Console

“The code input and output are in the R console”

Write R Codes in

R script using RStudio.

Write R Codes in

Quarto document using RStudio

:::

R Script (.R)

Write codes in the R script $\rightarrow$ Console will show the results.

Benefits of writing codes in R script:
- You can save it for later use and revision.
- You can share with others.
- A better track of codes.

💡 Tips for R Script

Writing readable code because other people might need to use your code.
Writing readable code because you might need to use your code, a few weeks/months/years after you’ve written it.
Put spaces between and around variable names and operators (=+-*/).
Break up long lines of code.
Keeping a consistent style.

FUNCTION

Artwork by Alision Horst

R Function

“A function, in a programming environment, is a set of instructions.”
“A programmer builds a function to avoid repeating the same task, or reduce complexity.”

R Function

Code
Output

round(24.3454, 3)

round(argument 1, argument 2)

[1] 24.345

Structure of R Function

Round Function

Code
Output

Function with default argument.

round(46.487)

[1] 46

Round Function

Code
Output

Function with a specific value of an argument.

round(x = 46.587, digits = 2)

[1] 46.59

Square Root Function

Code
Output

Function with a specific value of an argument.

sqrt(x = 9)

[1] 3

Sequence Function

Code
Output

Function with a specific value of an argument.

seq.int(from = 10, to = 30, by = 5)

seq.int(from = 10,
        to = 30,
        by = 5)

[1] 10 15 20 25 30

COMMENT

Artwork by Alision Horst

Comment:

“Humans will be able to read the comments, but your computer will pass over them.”
In R, # is used as a commenting symbol.

PACKAGES

R Packages

“An R package is a collection of functions, data, and documentation that extends the capabilities of base R. Using packages is key to the successful use of R.”¹

R Packages

Install Packages

Name of the Packages

Installed Packages

Function to Install Packages

install.packages("tidyverse")

Function to Call Package

library(tidyverse)

Using Packages

You need to install package only once like:
- 📚 We buy books once and use them again and again
- 💡 Fix the bulb once and use it again and again.

Using Packages

In every R document you need to call once the package using function library(), for example library(ggplot2).
Once in a while, you need to update the installed packages as well.
If you un-install R or RStudio, you will lose all installed packages.

Tools $\rightarrow$ Package Updates

Select Packages to Update

Click Install Updates

To Remove Packages

OBJECTS

Artwork by Alision Horst

R Object

Just a name that you can use to call up stored data.

Important

R assignment operators: Assignment operators are used to assign values to variables.

my_var <- 3

my_var # print my_var

Create Object

Code
Output

age <- c(23, 25, 16, 40, 34)

age

[1] 23 25 16 40 34

Create Object

Code
Output

income <- c(23000, 25000, 16000, 4000, 34000)

income

[1] 23000 25000 16000 4000 34000

Create Object

Code
Output

name <- c("Bhim", "Rama", "Sara", "Phule", "Savitri")

name

[1] “Bhim” “Rama” “Sara” “Phule” “Savitri”

Create Object

Code
Output

place <- c("MH", "RJ", "DL", "HR", "HR")

place

[1] “MH” “RJ” “DL” “HR” “HR”

Create Object

Code
Output

weight_kg <- c(50, 52, 61, 40, 70)

weight_kg

[1] 50 52 61 40 70

💡Guidelines to Name R Objects:

a name cannot start with a number.
a name cannot use some special symbols, like ^, !, $, @, +, -, /, or *,:.
avoid caps.
avoid space.
use dash (like weight-kg) or underscore (like weight_kg).
if chronology matters then add date (2020-09-05-file-name).

RStudio Environment Window

🤔 How to combine all these objects and form a data set?

👇 Something Like This 😻😻

     name income age place weight_kg
1    Bhim  23000  23    MH        50
2    Rama  25000  25    RJ        52
3    Sara  16000  16    DL        61
4   Phule   4000  40    HR        40
5 Savitri  34000  34    HR        70

How to Create a Data Object?

Code
Output

example_df <- data.frame(name, income, age, place, weight_kg)

example_df

     name income age place weight_kg
1    Bhim  23000  23    MH        50
2    Rama  25000  25    RJ        52
3    Sara  16000  16    DL        61
4   Phule   4000  40    HR        40
5 Savitri  34000  34    HR        70

Export Data as a `csv` File

Code
Output

library(readr)

# create a folder in wd & name it "data"
write_csv(example_df, "data/example_df.csv")

To see the created file, check the “data” folder in your working directory.

List of All Objects

Code
Output

objects()

[1] “age” “example_df” “has_annotations” “income”
[5] “name” “place” “weight_kg”

COMMUNITY

Artwork by Alision Horst

Help Using Console `>`

in console type ?your query

for example ?knitr.
for example ?mtcars.
for example ?dplyr.

RStudio: Package Website

Posit Community

Stack Overflow

GitHub

PUBLISH
USING
QUARTO

QUARTO

Quarto is the Next Generation of R Markdown

Quarto

“An open-source scientific and technical publishing system”

Quarto can produce a wide variety of output formats:

Articles & Reports

Presentations

Interactive Docs

Websites

Books

Download Quarto

Get Started: Choose IDE

Create a New Quarto Document

File $\rightarrow$ New File $\rightarrow$ Quarto Document

New Quarto Document

Save Quarto Document

Process When You Render the Quarto Document

Quarto sends the .qmd file to knitr, which executes all of the code chunks and creates a new markdown (.md) document which includes the code and its output. The markdown file generated by knitr is then processed by pandoc, which is responsible for creating the finished file.

Process When You Render the Quarto Document

Source Editor vs. Visual Editor

Visual Editor

Source Editor

MARKDOWN

Text formatting

Markdown Syntax	Output
`normal`	normal
`italics`	italics
`bold`	bold
`*bold italics*`	*bold italics*

Text formatting

Markdown Syntax	Output
`superscript^2^`	superscript²
`subscript~2~`	subscript₂
`~~strike through~~`	~~strike through~~
`verbatim code`	`verbatim code`

Headings

Markdown Syntax	Output
`# Header 1`	Header 1
`## Header 2`	Header 2
`### Header 3`	Header 3
`#### Header 4`	Header 4
`##### Header 5`	Header 5
`###### Header 6`	Header 6

Insert Links

Markdown syntax	Output
`<https://saraedu.netlify.app/>`	https://saraedu.netlify.app/

Insert Links

Markdown syntax	Output
`[SARA](https://saraedu.netlify.app/)`	SARA

Add Images

If image is saved in your computer,
![](add image path here)

Markdown Syntax	Output
`![](rose.jpg)`

Add Images

If image is taken from the internet,
![](add image link here)

![Smile everyday](https://images.unsplash.com/photo-1627130595904-ebeeb6540a93?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D)

Unordered List

Markdown Syntax	Output
`* Item 1 * Item 2 * Item 3`	Item 1 Item 2 Item 3

Unordered List: Sub-items

Markdown Syntax	Output
`* Main items + Sub-item 1 + Sub-item 2 - Sub-sub-item 1`	Main items Sub-item 1 Sub-item 2 Sub-sub-item 1

Ordered List

Markdown Syntax	Output
`1. Eggs 1. Tea 1. Fish 1. Milk`	Eggs Tea Fish Milk

List

Markdown Syntax Output

Markdown Syntax	Output
`(@) A list whose numbering continues after (@) an interruption`	A list whose numbering continues after an interruption

(@)  A list whose numbering

continues after

(@)  an interruption

A list whose numbering

continues after

an interruption

List

Markdown Syntax	Output
`::: {} 1. A list ::: ::: {} 1. Followed by another list :::`	A list Followed by another list

Definition

term
: definition

Markdown Syntax	Output
`Power : Power is power.`	Power Power is power.

Equations

Use $ delimiters for inline math.

Markdown Syntax	Output
`It is a great equation $E = mc^{2}$`	It is a great equation $E=mc^{2}$

Equations

Use $$ delimiters for display math.

Markdown Syntax	Output
`It is a great equation $$E = mc^{2}$$`	It is a great equation \[E=mc^{2}\]

In-line Coding

`r `

Code

`r 1+1`

Output

Videos

You can include videos in documents using the
{{< video >}} short code.

Code

{{< video https://www.youtube.com/embed/wo9vZccmqwc >}}

Output

Tables

Markdown Syntax

| Right | Left | Default | Center |
|------:|:-----|---------|:------:|
|   12  |  12  |    12   |    12  |
|  123  |  123 |   123   |   123  |
|    1  |    1 |     1   |     1  |

Output

Right	Left	Default	Center
12	12	12	12
123	123	123	123
1	1	1	1

Know Your Data

“Happy families are all alike; every unhappy family is unhappy in its own way.” — Leo Tolstoy

Know Your Data

Code
Output

example_df

     name income age place weight_kg
1    Bhim  23000  23    MH        50
2    Rama  25000  25    RJ        52
3    Sara  16000  16    DL        61
4   Phule   4000  40    HR        40
5 Savitri  34000  34    HR        70

Data: Variables

     name income age place weight_kg
1    Bhim  23000  23    MH        50
2    Rama  25000  25    RJ        52
3    Sara  16000  16    DL        61
4   Phule   4000  40    HR        40
5 Savitri  34000  34    HR        70

Data: Observations

     name income age place weight_kg
1    Bhim  23000  23    MH        50
2    Rama  25000  25    RJ        52
3    Sara  16000  16    DL        61
4   Phule   4000  40    HR        40
5 Savitri  34000  34    HR        70

Data: Values

     name income age place weight_kg
1    Bhim  23000  23    MH        50
2    Rama  25000  25    RJ        52
3    Sara  16000  16    DL        61
4   Phule   4000  40    HR        40
5 Savitri  34000  34    HR        70

Column/Variable/Data Types

chr = character, example "A"
int = integer, example 1L
dbl = double, example 1.5
lgl = logical, example TRUE
fct = factor, example factor("A")

Numeric Value Types

Data Summary

Code
Output

summary(example_df)

     name               income           age          place          
 Length:5           Min.   : 4000   Min.   :16.0   Length:5          
 Class :character   1st Qu.:16000   1st Qu.:23.0   Class :character  
 Mode  :character   Median :23000   Median :25.0   Mode  :character  
                    Mean   :20400   Mean   :27.6                     
                    3rd Qu.:25000   3rd Qu.:34.0                     
                    Max.   :34000   Max.   :40.0                     
   weight_kg   
 Min.   :40.0  
 1st Qu.:50.0  
 Median :52.0  
 Mean   :54.6  
 3rd Qu.:61.0  
 Max.   :70.0

Data Summary

Code
Output

str(example_df)

'data.frame':   5 obs. of  5 variables:
 $ name     : chr  "Bhim" "Rama" "Sara" "Phule" ...
 $ income   : num  23000 25000 16000 4000 34000
 $ age      : num  23 25 16 40 34
 $ place    : chr  "MH" "RJ" "DL" "HR" ...
 $ weight_kg: num  50 52 61 40 70

Data Summary

Code
Output

glimpse(example_df)

Rows: 5
Columns: 5
$ name      <chr> "Bhim", "Rama", "Sara", "Phule", "Savitri"
$ income    <dbl> 23000, 25000, 16000, 4000, 34000
$ age       <dbl> 23, 25, 16, 40, 34
$ place     <chr> "MH", "RJ", "DL", "HR", "HR"
$ weight_kg <dbl> 50, 52, 61, 40, 70

Data Summary

Code
Output

library(skimr)

skim(example_df)

Data summary
Name	example_df
Number of rows	5
Number of columns	5
_______________________
Column type frequency:
character	2
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
name	0	1	4	7	0	5	0
place	0	1	2	2	0	4	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
income	1	20400.0	11193.75	4000	16000	23000	25000	34000	▃▃▁▇▃
age	1	27.6	9.45	16	23	25	34	40	▃▇▁▃▃
weight_kg	1	54.6	11.39	40	50	52	61	70	▃▇▁▃▃

palmerpenguins

R package name palmerpenguins & dataset name is penguins more information here.

Data Summary

Code
Output

library(palmerpenguins) #do not forget to call the pkg

glimpse(penguins)

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

🧠 YOUR TURN

How to summarize the penguins data using the summary(), skim(), & str() functions?

05:00

DATA
VISUALIZATION

“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey

Goal

How to visualize data using R package ggplot2.

ggplot2 Layers

Import Data

Task
Code
Output

library(tidyverse)
library(palmerpenguins)

ggplot(data = penguins)

Map Variables Aesthetics

Task
Code
Output

ggplot(data = penguins,
       mapping = aes(x = species))

Add Geometric Shapes

Task
Code
Output

ggplot(data = penguins,
       mapping = aes(x = species)) +
  geom_bar()

Key Components are:

data,
A set of aesthetic mappings between variables in the data and visual properties, and
At least one layer which describes how to render each observation. Layers are usually created with a geom function.

🧠 YOUR TURN

Task
Answer

ggplot(data = penguins,
       mapping = aes(x = island)) +
  geom_bar()

05:00

Common Mistakes

Make sure that every ( is matched with a ) and every " is paired with another ".
Console shows no results but a + sign that means your code is incomplete and R is waiting for you to complete the code.
in ggplot + has to come at the end of the line, not the start

“Fill” Color

Task
Code
Output

ggplot(data = penguins,
       mapping = aes(x = species)) +
  geom_bar(fill = "blue")

“Fill” Colors

Task
Code
Output

ggplot(data = penguins,
       mapping = aes(x = species)) +
  geom_bar(fill = c("blue", "green", "yellow"))

“Fill” & “Color” Colors

Task
Code
Output

ggplot(data = penguins,
       mapping = aes(x = species)) +
  geom_bar(fill = c("blue", "green", "yellow"),
           color = "black")

🧠 YOUR TURN

Task
Answer

ggplot(data = penguins,
       mapping = aes(x = island)) +
  geom_bar(fill = c("red", "yellow", "darkgreen"),
           color = "black")

05:00

Know Your Data

glimpse(penguins)

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Plot A Continuous Variable

Task
Code
Output

# bill_length_mm is dbl type variable/column

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm)) +
  geom_histogram()

🧠 YOUR TURN

Task
Answer

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm)) +
  geom_histogram(fill = "darkblue",
                 color = "white")

05:00

Two Continuous Variables

Task
Code
Output

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point()

Geom Size

Task
Code
Output

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(size = 5)

Geom Shape

Task
Code
Output

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(size = 5,
             shape = 8)

🧠 YOUR TURN

Task
Answer

ggplot(data = penguins,
       mapping = aes(x = body_mass_g, y = flipper_length_mm)) +
  geom_point(size = 2, shape = 23, color = "red", fill = "gold")

05:00

Plot A Factor & Factor

Task
Code
Output

Sometimes, we want to differentiate values of a factor/category variable on the basis of another factor/category variable.

ggplot(data = penguins,
       mapping = aes(x = island)) +
  geom_bar(aes(fill = sex))

Plot A Factor & Continuous

Task
Code
Output

Sometimes, we want to differentiate values from a continuous variable on the basis of factor/category variables.

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm)) +
  geom_histogram(aes(fill = sex), color = "black")

A Factor & Two Cont.

Task
Code
Output

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = sex))

A Factor & Two Cont.

Task
Code
Output

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species))

Write Labels

Task
Code
Output

Title of the plot
Subtitle of the plot with more information
Title of the x-axis
Title of the y-axis

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species)) +
  labs(
    title = "The title of the plot",
    subtitle = "The subtitle of the plot",
    x = "Bill length (mm)",
    y = "Bill depth (mm)"
  )

Different Shapes

Task
Code
Output

Each level of the factor/category can be shown using a different shape of different color.

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species, shape = species)) +
  labs(
    title = "The title of the plot",
    subtitle = "The subtitle of the plot",
    x = "Bill length (mm)",
    y = "Bill depth (mm)"
  )

Various Themes

Task
Code
Output

Source: ggthemes

library(ggthemes)

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species, shape = species)) +
  labs(
    title = "The title of the plot",
    subtitle = "The subtitle of the plot",
    x = "Bill length (mm)",
    y = "Bill depth (mm)"
  ) +
  theme_economist()

Various Themes

Task
Code
Output

Source: ggthemes

library(ggthemes)

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species, shape = species)) +
  labs(
    title = "The title of the plot",
    subtitle = "The subtitle of the plot",
    x = "Bill length (mm)",
    y = "Bill depth (mm)"
  ) +
  theme_solarized_2()

Various Themes

Task
Code
Output

Source: ggthemes

library(ggthemes)

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species, shape = species)) +
  labs(
    title = "The title of the plot",
    subtitle = "The subtitle of the plot",
    x = "Bill length (mm)",
    y = "Bill depth (mm)"
  ) +
  theme_tufte()

Various Themes

Task
Code
Output

Source: ggthemes

library(ggthemes)

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species, shape = species)) +
  labs(
    title = "The title of the plot",
    subtitle = "The subtitle of the plot",
    x = "Bill length (mm)",
    y = "Bill depth (mm)"
  ) +
  theme_clean()

Color Palette

Task
Code
Output

R package ggthemes have function to use color scheme for colorblindness. Know more

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species, shape = species)) +
  labs(
    title = "The title of the plot",
    subtitle = "The subtitle of the plot",
    x = "Bill length (mm)",
    y = "Bill depth (mm)"
  ) +
  theme_clean() +
  scale_color_colorblind()

Color Palette

Task
Code
Output

library(RColorBrewer)

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species, shape = species)) +
  labs(
    title = "The title of the plot",
    subtitle = "The subtitle of the plot",
    x = "Bill length (mm)",
    y = "Bill depth (mm)"
  ) +
  theme_clean() +
  scale_color_brewer(palette = "Dark2")

Color Palette

Task
Code
Output

library(wesanderson)

names(wes_palettes)

 [1] "BottleRocket1"     "BottleRocket2"     "Rushmore1"        
 [4] "Rushmore"          "Royal1"            "Royal2"           
 [7] "Zissou1"           "Zissou1Continuous" "Darjeeling1"      
[10] "Darjeeling2"       "Chevalier1"        "FantasticFox1"    
[13] "Moonrise1"         "Moonrise2"         "Moonrise3"        
[16] "Cavalcanti1"       "GrandBudapest1"    "GrandBudapest2"   
[19] "IsleofDogs1"       "IsleofDogs2"       "FrenchDispatch"   
[22] "AsteroidCity1"     "AsteroidCity2"     "AsteroidCity3"

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species, shape = species)) +
  labs(
    title = "The title of the plot",
    subtitle = "The subtitle of the plot",
    x = "Bill length (mm)",
    y = "Bill depth (mm)"
  ) +
  theme_clean() +
  scale_color_manual(values = wes_palette("BottleRocket2", n = 3))

Export Plot

Task
Code
Output

Export/save plot as pdf, jpg or png file.

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point(aes(color = species, shape = species)) +
  labs(
    title = "The title of the plot",
    subtitle = "The subtitle of the plot",
    x = "Bill length (mm)",
    y = "Bill depth (mm)"
  ) +
  theme_clean() +
  scale_color_manual(values = wes_palette("BottleRocket2", n = 3))

ggsave("penguins-plot.pdf")

🧑🏽‍💻👨🏽‍💻
Question & Answer

DATA
WRANGLING

Data Wrangling

“data exploration and data manipulation” by Jesse Mostipak
“tidying and transforming” by Hadley & Garrett

Transforming Data

“narrowing in on observations of interest …
creating new variables that are functions of existing variables and …
calculating a set of summary statistics.”

R Package dplyr

`dplyr` Package

“dplyr is a grammar of data manipulation”
“providing a consistent set of verbs that help you solve the most common data manipulation challenges:”

`dplyr` Functions:

`filter()` Function:

Picks cases/observations based on their values.

`filter()` Function

Task
Code
Output

How to have a data of only Gentoo penguins?

library(tidyverse)
library(palmerpenguins)
library(countdown)

penguins |> 
  filter(species == "Gentoo")

# A tibble: 124 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo  Biscoe           46.1          13.2               211        4500
 2 Gentoo  Biscoe           50            16.3               230        5700
 3 Gentoo  Biscoe           48.7          14.1               210        4450
 4 Gentoo  Biscoe           50            15.2               218        5700
 5 Gentoo  Biscoe           47.6          14.5               215        5400
 6 Gentoo  Biscoe           46.5          13.5               210        4550
 7 Gentoo  Biscoe           45.4          14.6               211        4800
 8 Gentoo  Biscoe           46.7          15.3               219        5200
 9 Gentoo  Biscoe           43.3          13.4               209        4400
10 Gentoo  Biscoe           46.8          15.4               215        5150
# ℹ 114 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Wait! What the f* is `|>`

This is called native pipe operator
|> let you “pipe” an object forward to a function or call expression
allowing you to express a sequence of operations that transform an object.
ctrl + shift + m = |>

`filter()` Function

Task
Code
Output

How to have a data of penguins of bill length more than 43 mm?

penguins |> 
  filter(bill_length_mm > 43)

# A tibble: 188 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           46            21.5               194        4200
 2 Adelie  Dream               44.1          19.7               196        4400
 3 Adelie  Torgersen           45.8          18.9               197        4150
 4 Adelie  Dream               43.2          18.5               192        4100
 5 Adelie  Biscoe              43.2          19                 197        4775
 6 Adelie  Biscoe              45.6          20.3               191        4600
 7 Adelie  Torgersen           44.1          18                 210        4000
 8 Adelie  Torgersen           43.1          19.2               197        3500
 9 Gentoo  Biscoe              46.1          13.2               211        4500
10 Gentoo  Biscoe              50            16.3               230        5700
# ℹ 178 more rows
# ℹ 2 more variables: sex <fct>, year <int>

🧠 YOUR TURN

Task
Answer
Output

How to have a data of only Adele penguins?
How to have a data of penguins of bill depth more than 10 mm?

# question number 1
penguins |> 
  filter(species == "Adelie")

# question number 2
penguins |> 
  filter(bill_depth_mm > 10)

# A tibble: 152 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 142 more rows
# ℹ 2 more variables: sex <fct>, year <int>

# A tibble: 342 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           39.2          19.6               195        4675
 8 Adelie  Torgersen           34.1          18.1               193        3475
 9 Adelie  Torgersen           42            20.2               190        4250
10 Adelie  Torgersen           37.8          17.1               186        3300
# ℹ 332 more rows
# ℹ 2 more variables: sex <fct>, year <int>

10:00

`filter()` Function

Task
Code
Output

How to have a data of Gentoo penguins of bill length more than 50 mm?

penguins |> 
  filter(species == "Gentoo",
         bill_length_mm > 50)

# A tibble: 22 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo  Biscoe           50.2          14.3               218        5700
 2 Gentoo  Biscoe           59.6          17                 230        6050
 3 Gentoo  Biscoe           50.5          15.9               222        5550
 4 Gentoo  Biscoe           50.5          15.9               225        5400
 5 Gentoo  Biscoe           50.1          15                 225        5000
 6 Gentoo  Biscoe           50.4          15.3               224        5550
 7 Gentoo  Biscoe           54.3          15.7               231        5650
 8 Gentoo  Biscoe           50.7          15                 223        5550
 9 Gentoo  Biscoe           51.1          16.3               220        6000
10 Gentoo  Biscoe           52.5          15.6               221        5450
# ℹ 12 more rows
# ℹ 2 more variables: sex <fct>, year <int>

`filter()` Function

Task
Code
Output

How to have a data of non-Gentoo penguins of bill length more than 50 mm and weight more than 4 kg?

penguins |> 
  filter(species != "Gentoo",
         bill_length_mm > 50,
         body_mass_g > 4000)

# A tibble: 11 × 8
   species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
 1 Chinstrap Dream            52            18.1               201        4050
 2 Chinstrap Dream            50.5          19.6               201        4050
 3 Chinstrap Dream            52            19                 197        4150
 4 Chinstrap Dream            52.8          20                 205        4550
 5 Chinstrap Dream            54.2          20.8               201        4300
 6 Chinstrap Dream            51            18.8               203        4100
 7 Chinstrap Dream            52            20.7               210        4800
 8 Chinstrap Dream            53.5          19.9               205        4500
 9 Chinstrap Dream            50.8          18.5               201        4450
10 Chinstrap Dream            50.7          19.7               203        4050
11 Chinstrap Dream            50.8          19                 210        4100
# ℹ 2 more variables: sex <fct>, year <int>

🧠 YOUR TURN

Task
Answer
Output

How to have a data of penguins only from the Dream island which have bill depth more than 7 mm and weight more than 3 kg?

penguins |> 
  filter(island == "Dream",
         bill_depth_mm > 7,
         body_mass_g > 3000)

# A tibble: 118 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Adelie  Dream            39.5          16.7               178        3250
 2 Adelie  Dream            37.2          18.1               178        3900
 3 Adelie  Dream            39.5          17.8               188        3300
 4 Adelie  Dream            40.9          18.9               184        3900
 5 Adelie  Dream            36.4          17                 195        3325
 6 Adelie  Dream            39.2          21.1               196        4150
 7 Adelie  Dream            38.8          20                 190        3950
 8 Adelie  Dream            42.2          18.5               180        3550
 9 Adelie  Dream            37.6          19.3               181        3300
10 Adelie  Dream            39.8          19.1               184        4650
# ℹ 108 more rows
# ℹ 2 more variables: sex <fct>, year <int>

07:00

`select()` Function:

Picks variables/columns based on their names.

`select()` Function

Task
Code
Output

How to keep only species variable in the data?

penguins |> 
  select(species)

# A tibble: 344 × 1
   species
   <fct>  
 1 Adelie 
 2 Adelie 
 3 Adelie 
 4 Adelie 
 5 Adelie 
 6 Adelie 
 7 Adelie 
 8 Adelie 
 9 Adelie 
10 Adelie 
# ℹ 334 more rows

`select()` Function

Task
Code
Output

How to keep only bill related variables in the data?

penguins |> 
  select(bill_length_mm, bill_depth_mm)

# A tibble: 344 × 2
   bill_length_mm bill_depth_mm
            <dbl>         <dbl>
 1           39.1          18.7
 2           39.5          17.4
 3           40.3          18  
 4           NA            NA  
 5           36.7          19.3
 6           39.3          20.6
 7           38.9          17.8
 8           39.2          19.6
 9           34.1          18.1
10           42            20.2
# ℹ 334 more rows

🧠 YOUR TURN

Task
Answer
Output

How to have a data of variables sex, year, island and flipper length?

penguins |> 
  select(sex, year, island, flipper_length_mm)

# A tibble: 344 × 4
   sex     year island    flipper_length_mm
   <fct>  <int> <fct>                 <int>
 1 male    2007 Torgersen               181
 2 female  2007 Torgersen               186
 3 female  2007 Torgersen               195
 4 <NA>    2007 Torgersen                NA
 5 female  2007 Torgersen               193
 6 male    2007 Torgersen               190
 7 female  2007 Torgersen               181
 8 male    2007 Torgersen               195
 9 <NA>    2007 Torgersen               193
10 <NA>    2007 Torgersen               190
# ℹ 334 more rows

05:00

💡 Tips for variable selection

Use names() function to see the exact names and the order of the variables.
Use : operator to select the range of variables.

penguins |> 
  select(island : flipper_length_mm)

Use location value of the variable.

penguins |> 
  select(3 : 7)

Use - operator to not to select the range of variables.

# results: hide

penguins |> 
  select(-c(island : flipper_length_mm))

# A tibble: 344 × 4
   species body_mass_g sex     year
   <fct>         <int> <fct>  <int>
 1 Adelie         3750 male    2007
 2 Adelie         3800 female  2007
 3 Adelie         3250 female  2007
 4 Adelie           NA <NA>    2007
 5 Adelie         3450 female  2007
 6 Adelie         3650 male    2007
 7 Adelie         3625 female  2007
 8 Adelie         4675 male    2007
 9 Adelie         3475 <NA>    2007
10 Adelie         4250 <NA>    2007
# ℹ 334 more rows

🧠 YOUR TURN

Task
Answer
Output

How to have a data of variables from location first to fifth but without the variable island?

penguins |> 
  select(c(1, 3:5))

# A tibble: 344 × 4
   species bill_length_mm bill_depth_mm flipper_length_mm
   <fct>            <dbl>         <dbl>             <int>
 1 Adelie            39.1          18.7               181
 2 Adelie            39.5          17.4               186
 3 Adelie            40.3          18                 195
 4 Adelie            NA            NA                  NA
 5 Adelie            36.7          19.3               193
 6 Adelie            39.3          20.6               190
 7 Adelie            38.9          17.8               181
 8 Adelie            39.2          19.6               195
 9 Adelie            34.1          18.1               193
10 Adelie            42            20.2               190
# ℹ 334 more rows

05:00

`mutate()` Function:

Adds new variables that are functions of existing variables.

`mutate()` Function

Task
Code
Output

How to convert body mass of penguins from grams to kilograms?

penguins |> 
  select(body_mass_g) |> 
  mutate(body_mass_kg = body_mass_g / 1000)

# A tibble: 344 × 2
   body_mass_g body_mass_kg
         <int>        <dbl>
 1        3750         3.75
 2        3800         3.8 
 3        3250         3.25
 4          NA        NA   
 5        3450         3.45
 6        3650         3.65
 7        3625         3.62
 8        4675         4.68
 9        3475         3.48
10        4250         4.25
# ℹ 334 more rows

`mutate()` Function

Task
Code
Output

How to measure the penguin’s bill size using length and depth?

penguins |> 
  mutate(bill_size = bill_length_mm * bill_depth_mm) |> 
  select(bill_size)

# A tibble: 344 × 1
   bill_size
       <dbl>
 1      731.
 2      687.
 3      725.
 4       NA 
 5      708.
 6      810.
 7      692.
 8      768.
 9      617.
10      848.
# ℹ 334 more rows

🧠 YOUR TURN

Task
Answer
Output

How to convert the bill dimensions from mm to cm?

penguins |> 
  select(bill_length_mm, bill_depth_mm) |> 
  mutate(bill_length_cm = bill_length_mm / 10, bill_depth_mm / 10)

# A tibble: 344 × 4
   bill_length_mm bill_depth_mm bill_length_cm `bill_depth_mm/10`
            <dbl>         <dbl>          <dbl>              <dbl>
 1           39.1          18.7           3.91               1.87
 2           39.5          17.4           3.95               1.74
 3           40.3          18             4.03               1.8 
 4           NA            NA            NA                 NA   
 5           36.7          19.3           3.67               1.93
 6           39.3          20.6           3.93               2.06
 7           38.9          17.8           3.89               1.78
 8           39.2          19.6           3.92               1.96
 9           34.1          18.1           3.41               1.81
10           42            20.2           4.2                2.02
# ℹ 334 more rows

05:00

`arrange()` Function:

Changes the ordering of the rows.

`arrange()` Function

Task
Code
Output

How to arrange data as per the bill length of the penguins?

penguins |> 
  arrange(bill_length_mm) #default is ascending order

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Dream               32.1          15.5               188        3050
 2 Adelie  Dream               33.1          16.1               178        2900
 3 Adelie  Torgersen           33.5          19                 190        3600
 4 Adelie  Dream               34            17.1               185        3400
 5 Adelie  Torgersen           34.1          18.1               193        3475
 6 Adelie  Torgersen           34.4          18.4               184        3325
 7 Adelie  Biscoe              34.5          18.1               187        2900
 8 Adelie  Torgersen           34.6          21.1               198        4400
 9 Adelie  Torgersen           34.6          17.2               189        3200
10 Adelie  Biscoe              35            17.9               190        3450
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

`arrange()` Function

Task
Code
Output

How to see five penguins of the least bill length?

penguins |> 
  arrange(bill_length_mm) |> 
  head(5) 

#tail function to see the bottom of the data

# A tibble: 5 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Dream               32.1          15.5               188        3050
2 Adelie  Dream               33.1          16.1               178        2900
3 Adelie  Torgersen           33.5          19                 190        3600
4 Adelie  Dream               34            17.1               185        3400
5 Adelie  Torgersen           34.1          18.1               193        3475
# ℹ 2 more variables: sex <fct>, year <int>

🧠 YOUR TURN

Task
Answer
Output

How to see five penguins of the highest bill length?

penguins |> 
  arrange(bill_length_mm) |> 
  tail(5)

# A tibble: 5 × 8
  species   island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>     <fct>              <dbl>         <dbl>             <int>       <int>
1 Gentoo    Biscoe              55.9          17                 228        5600
2 Chinstrap Dream               58            17.8               181        3700
3 Gentoo    Biscoe              59.6          17                 230        6050
4 Adelie    Torgersen           NA            NA                  NA          NA
5 Gentoo    Biscoe              NA            NA                  NA          NA
# ℹ 2 more variables: sex <fct>, year <int>

05:00

`summarise()` Function:

Reduces multiple values down to a single summary.

`summarise()` Function

Task
Code
Output

What is the mean bill length of penguins?

penguins |> 
  summarise(mean(bill_length_mm))

# A tibble: 1 × 1
  `mean(bill_length_mm)`
                   <dbl>
1                     NA

`summarise()` Function

Task
Code
Output

What is the mean bill length of penguins after removing the missing values?

penguins |>
  drop_na() |> 
  summarise(mean(bill_length_mm))

# A tibble: 1 × 1
  `mean(bill_length_mm)`
                   <dbl>
1                   44.0

`summarise()` Function

Task
Code
Output

What is the species wise mean bill length of penguins?

penguins |>
  drop_na() |> 
  group_by(species) |> 
  summarise(mean(bill_length_mm))

# A tibble: 3 × 2
  species   `mean(bill_length_mm)`
  <fct>                      <dbl>
1 Adelie                      38.8
2 Chinstrap                   48.8
3 Gentoo                      47.6

`summarise()` Function

Task
Code
Output

What is the species wise mean bill length of penguins and total number of penguins in each specie?

penguins |>
  drop_na() |> 
  group_by(species) |> 
  summarise(mean(bill_length_mm),
            n = n())

# n() function to know the number of observations in the current group

# A tibble: 3 × 3
  species   `mean(bill_length_mm)`     n
  <fct>                      <dbl> <int>
1 Adelie                      38.8   146
2 Chinstrap                   48.8    68
3 Gentoo                      47.6   119

🧠 YOUR TURN

Task
Answer
Output

Who are of more weight male or female penguins?

penguins |>
  drop_na() |> 
  group_by(sex) |> 
  summarise(mean(body_mass_g),
            n = n())

# A tibble: 2 × 3
  sex    `mean(body_mass_g)`     n
  <fct>                <dbl> <int>
1 female               3862.   165
2 male                 4546.   168

05:00

References

Title slide background image is from Joanna Kosinska.
R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. ebook link
R bloggers https://www.r-bloggers.com/
The R Project for Statistical Computing https://www.r-project.org/
posit (earlier RStudio) https://posit.co/
R packages for data science https://www.tidyverse.org/

Thank
You

FORBEGINNERS

About SARA

Purpose

Work

Work

🤯 Work Flowchart

Work Influencer

Data Science

Career in Data Science

Types of Data Roles

Data Science Process

Table of Content

R Programming Language

History of R

R is FREE

R Console

Workspace Image

Operators

R Arithmetic Operators

R Console

Code

Output

R Console: Addition

Code

Output

R Console: Subtraction

Code

Output

R Console: Multiplication

Code

Output

R Console: Division

Code

Output

R Comparison Operators

R Console: Logic

Code

Output

R Console: Logic

Code

Output

R Console

Code

Output

R Console

Code

Output

Plot Using R

😏 That’s Okay But How To

posit, earlier RStudio

RStudio IDE

RStudio IDE

RStudio \(\rightarrow\) Tools \(\rightarrow\) Global Options

RStudio \(\rightarrow\) Tools \(\rightarrow\) Global Options

R & RStudio

Imagine R as a powerful engine

and RStudio as a stylish car

Open RStudio

RStudio Without Project

RStudio Without Project

RStudio Project Helps:

Create RStudio Project

Create RStudio Project

In Case Anything Goes Wrong\(...\)

Create RStudio Project

Create RStudio Project

Create RStudio Project

Create RStudio Project

Create RStudio Project

RStudio Project “name”

RStudio Project “path”

RStudio Project

Write R Codes in

Write R Codes in

Write R Codes in

R Script (.R)

💡 Tips for R Script

R Function

R Function

Structure of R Function

FOR
BEGINNERS

Export Data as a `csv` File

Help Using Console `>`

PUBLISH
USING
QUARTO