7
Dr. Ajay Kumar Koli, PhD \(\cdot\) SARA Institute of Data Science \(\cdot\) India
Hello Everyone!!
SARA stands for Savitribai Ramabai Institute of Data Science.
It is a Charitable Education Trust, est. in 2023 by Dr. Ajay Kumar Koli & Dr. Kiran Lata Koli.
Our mission to enhance the representation of under-privileged communities in the field of data science.
To share reasons with you (and hopefully convince you) why you should learn data science tools like R, python, GitHub, quarto etc.
We think we work like this but …
This is our work look actually.
MS Word
Excel
PPTs
Other Tools:
SPSS
SAS
STATA
Reference management
👀 Focus on Content.
💅 Set Yourself Apart.
“Data science is an exciting discipline that allows you to transform raw data into understanding, insight, and knowledge.”
“You don’t need to be an expert programmer to be a successful data scientist.”
R and
RStudio
“R is a free software environment for statistical computing and graphics.”
Initially developed as S language by Bells Labs.
First appeared in August 1993 as R language by:
Ross Ihaka
(New Zealand Statistician)
Robert Gentleman
(Canadian Statistician)
Download R from CRAN
R version 4.3.1 (2023-06-16)
R name “Beagle Scouts”
R licence “ABSOLUTELY NO WARRANTY”
R prompt >|
Don’t save workspace image.
It helps in “freshly minted R sessions”.
“put more trust in your script than in your memory”.
OPERATORS
“Operators are used to perform operations on variables and values.”
12 + 3
in this code +
is an operator.
Tip
Put spaces between and around operators (=+-*/
)
Arithmetic operators are used with numeric values to perform common mathematical operations:
Operator | Name | Example |
---|---|---|
+ |
Addition | x + y |
- |
Subtraction | x - y |
* |
Multiplication | x * y |
/ |
Division | x / y |
^ |
Exponent | x ^ y |
7
[1] 7
2 + 1
[1] 3
10 - 2
[1] 8
12 * 4
[1] 48
25 / 5
[1] 5
Comparison operators are used to compare two values:
Operator | Name | Example |
---|---|---|
== |
Equal | x == y |
!= |
Not equal | x != y |
> |
Greater than | x > y |
< |
Less than | x < y |
>= |
Greater than or equal to | x >= y |
<= |
Less than or equal to | x <= y |
4 == 5
[1] FALSE
67 > 60
[1] TRUE
3434 + 343453 * 2323 / 534 - 1000
[1] 1496519
Important
R follows the BODMAS (bracket, order, division, multiplication, addition and subtraction) rule to solve mathematical equations.
12:18
[1] 12 13 14 15 16 17 18
Important
R Miscellaneous Operators: Miscellaneous operators are used to manipulate data.
Operator | Description | Example |
: |
Creates a series of numbers in a sequence | 1:10 |
combine plot, text, tables and images in a single file.
publish my work online or convert into a word, pdf or html file.
work efficiently with my different projects and save, share and track them.
🔥 WE NEED A SUPERHERO 🔥
S T U D I O
RStudio is an integrated development environment (IDE) for R and Python.
As per posit, RStudio is “the most trusted IDE for open source data science”.
Download RStudio.
“It includes a console, syntax-highlighting editor that supports direct code execution, and tools for plotting, history, debugging, and workspace management.”
PROJECT
⛷️ Create
RStudio Project
in 4 Steps ⛷️
Artwork by Alision Horst
R Console
R script using RStudio.
Quarto document using RStudio
:::
Write codes in the R script \(\rightarrow\) Console will show the results.
Writing readable code because other people might need to use your code.
Writing readable code because you might need to use your code, a few weeks/months/years after you’ve written it.
Put spaces between and around variable names and operators (=+-*/
).
Break up long lines of code.
Keeping a consistent style.
FUNCTION
Artwork by Alision Horst
“A function, in a programming environment, is a set of instructions.”
“A programmer builds a function to avoid repeating the same task, or reduce complexity.”
COMMENT
Artwork by Alision Horst
“Humans will be able to read the comments, but your computer will pass over them.”
In R, #
is used as a commenting symbol.
PACKAGES
“An R package is a collection of functions, data, and documentation that extends the capabilities of base R. Using packages is key to the successful use of R.”1
install.packages("tidyverse")
You need to install package only once like:
📚 We buy books once and use them again and again
💡 Fix the bulb once and use it again and again.
In every R document you need to call once the package using function library()
, for example library(ggplot2).
Once in a while, you need to update the installed packages as well.
If you un-install R or RStudio, you will lose all installed packages.
OBJECTS
Artwork by Alision Horst
Just a name that you can use to call up stored data.
Important
R assignment operators: Assignment operators are used to assign values to variables.
my_var <- 3
my_var # print my_var
a name cannot start with a number.
a name cannot use some special symbols, like ^
, !
, $
, @
, +
, -
, /
, or *
,:
.
avoid caps.
avoid space.
use dash (like weight-kg) or underscore (like weight_kg).
if chronology matters then add date (2020-09-05-file-name).
🤔 How to combine all these objects and form a data set?
name income age place weight_kg
1 Bhim 23000 23 MH 50
2 Rama 25000 25 RJ 52
3 Sara 16000 16 DL 61
4 Phule 4000 40 HR 40
5 Savitri 34000 34 HR 70
example_df <- data.frame(name, income, age, place, weight_kg)
example_df
name income age place weight_kg
1 Bhim 23000 23 MH 50
2 Rama 25000 25 RJ 52
3 Sara 16000 16 DL 61
4 Phule 4000 40 HR 40
5 Savitri 34000 34 HR 70
csv
FileCOMMUNITY
Artwork by Alision Horst
>
in console type
?your query
QUARTO
“An open-source scientific and technical publishing system”
Quarto can produce a wide variety of output formats:
Articles & Reports
Presentations
Interactive Docs
Websites
Books
Quarto sends the
.qmd
file to knitr, which executes all of the code chunks and creates a new markdown (.md
) document which includes the code and its output. The markdown file generated by knitr is then processed by pandoc, which is responsible for creating the finished file.
MARKDOWN
Markdown Syntax | Output |
---|---|
normal | |
italics | |
bold | |
bold italics |
Markdown Syntax | Output |
---|---|
superscript2 | |
subscript2 | |
verbatim code |
Markdown Syntax | Output |
---|---|
Header 1 |
|
Header 2 |
|
Header 3 |
|
Header 4 |
|
Header 5 |
|
Header 6 |
Markdown syntax | Output |
---|---|
https://saraedu.netlify.app/ |
Markdown syntax | Output |
---|---|
SARA |
If image is saved in your computer,
![](add image path here)
If image is taken from the internet,
![](add image link here)
Use
$
delimiters for inline math.
Use
$$
delimiters for display math.
`r `
`r 1+1`
2
You can include videos in documents using the
{{< video >}}
short code.
| Right | Left | Default | Center |
|------:|:-----|---------|:------:|
| 12 | 12 | 12 | 12 |
| 123 | 123 | 123 | 123 |
| 1 | 1 | 1 | 1 |
Right | Left | Default | Center |
---|---|---|---|
12 | 12 | 12 | 12 |
123 | 123 | 123 | 123 |
1 | 1 | 1 | 1 |
Know Your Data
“Happy families are all alike; every unhappy family is unhappy in its own way.” — Leo Tolstoy
name income age place weight_kg
1 Bhim 23000 23 MH 50
2 Rama 25000 25 RJ 52
3 Sara 16000 16 DL 61
4 Phule 4000 40 HR 40
5 Savitri 34000 34 HR 70
name income age place weight_kg
1 Bhim 23000 23 MH 50
2 Rama 25000 25 RJ 52
3 Sara 16000 16 DL 61
4 Phule 4000 40 HR 40
5 Savitri 34000 34 HR 70
name income age place weight_kg
1 Bhim 23000 23 MH 50
2 Rama 25000 25 RJ 52
3 Sara 16000 16 DL 61
4 Phule 4000 40 HR 40
5 Savitri 34000 34 HR 70
chr
= character, example "A"
int
= integer, example 1L
dbl
= double, example 1.5
lgl
= logical, example TRUE
fct
= factor, example factor("A")
summary(example_df)
name income age place
Length:5 Min. : 4000 Min. :16.0 Length:5
Class :character 1st Qu.:16000 1st Qu.:23.0 Class :character
Mode :character Median :23000 Median :25.0 Mode :character
Mean :20400 Mean :27.6
3rd Qu.:25000 3rd Qu.:34.0
Max. :34000 Max. :40.0
weight_kg
Min. :40.0
1st Qu.:50.0
Median :52.0
Mean :54.6
3rd Qu.:61.0
Max. :70.0
Name | example_df |
Number of rows | 5 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
name | 0 | 1 | 4 | 7 | 0 | 5 | 0 |
place | 0 | 1 | 2 | 2 | 0 | 4 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
income | 0 | 1 | 20400.0 | 11193.75 | 4000 | 16000 | 23000 | 25000 | 34000 | ▃▃▁▇▃ |
age | 0 | 1 | 27.6 | 9.45 | 16 | 23 | 25 | 34 | 40 | ▃▇▁▃▃ |
weight_kg | 0 | 1 | 54.6 | 11.39 | 40 | 50 | 52 | 61 | 70 | ▃▇▁▃▃ |
R package name palmerpenguins
& dataset name is penguins
more information here.
library(palmerpenguins) #do not forget to call the pkg
glimpse(penguins)
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
How to summarize the penguins
data using the summary()
, skim()
, & str()
functions?
05:00
“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey
How to visualize data using R package ggplot2.
data,
A set of aesthetic mappings between variables in the data and visual properties, and
At least one layer which describes how to render each observation. Layers are usually created with a geom function.
05:00
Make sure that every (
is matched with a )
and every "
is paired with another "
.
Console shows no results but a +
sign that means your code is incomplete and R is waiting for you to complete the code.
in ggplot +
has to come at the end of the line, not the start
05:00
glimpse(penguins)
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
ggplot(data = penguins,
mapping = aes(x = bill_length_mm)) +
geom_histogram(fill = "darkblue",
color = "white")
05:00
ggplot(data = penguins,
mapping = aes(x = body_mass_g, y = flipper_length_mm)) +
geom_point(size = 2, shape = 23, color = "red", fill = "gold")
05:00
Title of the plot
Subtitle of the plot with more information
Title of the x-axis
Title of the y-axis
Each level of the factor/category can be shown using a different shape of different color.
R package ggthemes
have function to use color scheme for colorblindness. Know more
ggplot(data = penguins,
mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point(aes(color = species, shape = species)) +
labs(
title = "The title of the plot",
subtitle = "The subtitle of the plot",
x = "Bill length (mm)",
y = "Bill depth (mm)"
) +
theme_clean() +
scale_color_colorblind()
library(RColorBrewer)
ggplot(data = penguins,
mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point(aes(color = species, shape = species)) +
labs(
title = "The title of the plot",
subtitle = "The subtitle of the plot",
x = "Bill length (mm)",
y = "Bill depth (mm)"
) +
theme_clean() +
scale_color_brewer(palette = "Dark2")
library(wesanderson)
names(wes_palettes)
[1] "BottleRocket1" "BottleRocket2" "Rushmore1"
[4] "Rushmore" "Royal1" "Royal2"
[7] "Zissou1" "Zissou1Continuous" "Darjeeling1"
[10] "Darjeeling2" "Chevalier1" "FantasticFox1"
[13] "Moonrise1" "Moonrise2" "Moonrise3"
[16] "Cavalcanti1" "GrandBudapest1" "GrandBudapest2"
[19] "IsleofDogs1" "IsleofDogs2" "FrenchDispatch"
[22] "AsteroidCity1" "AsteroidCity2" "AsteroidCity3"
ggplot(data = penguins,
mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point(aes(color = species, shape = species)) +
labs(
title = "The title of the plot",
subtitle = "The subtitle of the plot",
x = "Bill length (mm)",
y = "Bill depth (mm)"
) +
theme_clean() +
scale_color_manual(values = wes_palette("BottleRocket2", n = 3))
Export/save plot as pdf, jpg or png file.
ggplot(data = penguins,
mapping = aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point(aes(color = species, shape = species)) +
labs(
title = "The title of the plot",
subtitle = "The subtitle of the plot",
x = "Bill length (mm)",
y = "Bill depth (mm)"
) +
theme_clean() +
scale_color_manual(values = wes_palette("BottleRocket2", n = 3))
ggsave("penguins-plot.pdf")
🧑🏽💻👨🏽💻
Question & Answer
DATA
WRANGLING
“data exploration and data manipulation” by Jesse Mostipak
“tidying and transforming” by Hadley & Garrett
“narrowing in on observations of interest …
creating new variables that are functions of existing variables and …
calculating a set of summary statistics.”
R Package dplyr
dplyr
Package“dplyr is a grammar of data manipulation”
“providing a consistent set of verbs that help you solve the most common data manipulation challenges:”
dplyr
Functions:filter()
Function:Picks cases/observations based on their values.
filter()
FunctionHow to have a data of only Gentoo penguins?
# A tibble: 124 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Gentoo Biscoe 46.1 13.2 211 4500
2 Gentoo Biscoe 50 16.3 230 5700
3 Gentoo Biscoe 48.7 14.1 210 4450
4 Gentoo Biscoe 50 15.2 218 5700
5 Gentoo Biscoe 47.6 14.5 215 5400
6 Gentoo Biscoe 46.5 13.5 210 4550
7 Gentoo Biscoe 45.4 14.6 211 4800
8 Gentoo Biscoe 46.7 15.3 219 5200
9 Gentoo Biscoe 43.3 13.4 209 4400
10 Gentoo Biscoe 46.8 15.4 215 5150
# ℹ 114 more rows
# ℹ 2 more variables: sex <fct>, year <int>
|>
This is called native pipe operator
|>
let you “pipe” an object forward to a function or call expression
allowing you to express a sequence of operations that transform an object.
ctrl + shift + m = |>
filter()
FunctionHow to have a data of penguins of bill length more than 43 mm?
penguins |>
filter(bill_length_mm > 43)
# A tibble: 188 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 46 21.5 194 4200
2 Adelie Dream 44.1 19.7 196 4400
3 Adelie Torgersen 45.8 18.9 197 4150
4 Adelie Dream 43.2 18.5 192 4100
5 Adelie Biscoe 43.2 19 197 4775
6 Adelie Biscoe 45.6 20.3 191 4600
7 Adelie Torgersen 44.1 18 210 4000
8 Adelie Torgersen 43.1 19.2 197 3500
9 Gentoo Biscoe 46.1 13.2 211 4500
10 Gentoo Biscoe 50 16.3 230 5700
# ℹ 178 more rows
# ℹ 2 more variables: sex <fct>, year <int>
How to have a data of only Adele penguins?
How to have a data of penguins of bill depth more than 10 mm?
# A tibble: 152 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 142 more rows
# ℹ 2 more variables: sex <fct>, year <int>
# A tibble: 342 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen 36.7 19.3 193 3450
5 Adelie Torgersen 39.3 20.6 190 3650
6 Adelie Torgersen 38.9 17.8 181 3625
7 Adelie Torgersen 39.2 19.6 195 4675
8 Adelie Torgersen 34.1 18.1 193 3475
9 Adelie Torgersen 42 20.2 190 4250
10 Adelie Torgersen 37.8 17.1 186 3300
# ℹ 332 more rows
# ℹ 2 more variables: sex <fct>, year <int>
10:00
filter()
FunctionHow to have a data of Gentoo penguins of bill length more than 50 mm?
penguins |>
filter(species == "Gentoo",
bill_length_mm > 50)
# A tibble: 22 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Gentoo Biscoe 50.2 14.3 218 5700
2 Gentoo Biscoe 59.6 17 230 6050
3 Gentoo Biscoe 50.5 15.9 222 5550
4 Gentoo Biscoe 50.5 15.9 225 5400
5 Gentoo Biscoe 50.1 15 225 5000
6 Gentoo Biscoe 50.4 15.3 224 5550
7 Gentoo Biscoe 54.3 15.7 231 5650
8 Gentoo Biscoe 50.7 15 223 5550
9 Gentoo Biscoe 51.1 16.3 220 6000
10 Gentoo Biscoe 52.5 15.6 221 5450
# ℹ 12 more rows
# ℹ 2 more variables: sex <fct>, year <int>
filter()
FunctionHow to have a data of non-Gentoo penguins of bill length more than 50 mm and weight more than 4 kg?
penguins |>
filter(species != "Gentoo",
bill_length_mm > 50,
body_mass_g > 4000)
# A tibble: 11 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Chinstrap Dream 52 18.1 201 4050
2 Chinstrap Dream 50.5 19.6 201 4050
3 Chinstrap Dream 52 19 197 4150
4 Chinstrap Dream 52.8 20 205 4550
5 Chinstrap Dream 54.2 20.8 201 4300
6 Chinstrap Dream 51 18.8 203 4100
7 Chinstrap Dream 52 20.7 210 4800
8 Chinstrap Dream 53.5 19.9 205 4500
9 Chinstrap Dream 50.8 18.5 201 4450
10 Chinstrap Dream 50.7 19.7 203 4050
11 Chinstrap Dream 50.8 19 210 4100
# ℹ 2 more variables: sex <fct>, year <int>
How to have a data of penguins only from the Dream island which have bill depth more than 7 mm and weight more than 3 kg?
penguins |>
filter(island == "Dream",
bill_depth_mm > 7,
body_mass_g > 3000)
# A tibble: 118 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Dream 39.5 16.7 178 3250
2 Adelie Dream 37.2 18.1 178 3900
3 Adelie Dream 39.5 17.8 188 3300
4 Adelie Dream 40.9 18.9 184 3900
5 Adelie Dream 36.4 17 195 3325
6 Adelie Dream 39.2 21.1 196 4150
7 Adelie Dream 38.8 20 190 3950
8 Adelie Dream 42.2 18.5 180 3550
9 Adelie Dream 37.6 19.3 181 3300
10 Adelie Dream 39.8 19.1 184 4650
# ℹ 108 more rows
# ℹ 2 more variables: sex <fct>, year <int>
07:00
select()
Function:Picks variables/columns based on their names.
select()
Functionselect()
FunctionHow to keep only bill related variables in the data?
penguins |>
select(bill_length_mm, bill_depth_mm)
# A tibble: 344 × 2
bill_length_mm bill_depth_mm
<dbl> <dbl>
1 39.1 18.7
2 39.5 17.4
3 40.3 18
4 NA NA
5 36.7 19.3
6 39.3 20.6
7 38.9 17.8
8 39.2 19.6
9 34.1 18.1
10 42 20.2
# ℹ 334 more rows
How to have a data of variables sex, year, island and flipper length?
penguins |>
select(sex, year, island, flipper_length_mm)
# A tibble: 344 × 4
sex year island flipper_length_mm
<fct> <int> <fct> <int>
1 male 2007 Torgersen 181
2 female 2007 Torgersen 186
3 female 2007 Torgersen 195
4 <NA> 2007 Torgersen NA
5 female 2007 Torgersen 193
6 male 2007 Torgersen 190
7 female 2007 Torgersen 181
8 male 2007 Torgersen 195
9 <NA> 2007 Torgersen 193
10 <NA> 2007 Torgersen 190
# ℹ 334 more rows
05:00
Use names()
function to see the exact names and the order of the variables.
Use :
operator to select the range of variables.
penguins |>
select(island : flipper_length_mm)
penguins |>
select(3 : 7)
-
operator to not to select the range of variables.# A tibble: 344 × 4
species body_mass_g sex year
<fct> <int> <fct> <int>
1 Adelie 3750 male 2007
2 Adelie 3800 female 2007
3 Adelie 3250 female 2007
4 Adelie NA <NA> 2007
5 Adelie 3450 female 2007
6 Adelie 3650 male 2007
7 Adelie 3625 female 2007
8 Adelie 4675 male 2007
9 Adelie 3475 <NA> 2007
10 Adelie 4250 <NA> 2007
# ℹ 334 more rows
How to have a data of variables from location first to fifth but without the variable island?
# A tibble: 344 × 4
species bill_length_mm bill_depth_mm flipper_length_mm
<fct> <dbl> <dbl> <int>
1 Adelie 39.1 18.7 181
2 Adelie 39.5 17.4 186
3 Adelie 40.3 18 195
4 Adelie NA NA NA
5 Adelie 36.7 19.3 193
6 Adelie 39.3 20.6 190
7 Adelie 38.9 17.8 181
8 Adelie 39.2 19.6 195
9 Adelie 34.1 18.1 193
10 Adelie 42 20.2 190
# ℹ 334 more rows
05:00
mutate()
Function:Adds new variables that are functions of existing variables.
mutate()
FunctionHow to convert body mass of penguins from grams to kilograms?
# A tibble: 344 × 2
body_mass_g body_mass_kg
<int> <dbl>
1 3750 3.75
2 3800 3.8
3 3250 3.25
4 NA NA
5 3450 3.45
6 3650 3.65
7 3625 3.62
8 4675 4.68
9 3475 3.48
10 4250 4.25
# ℹ 334 more rows
mutate()
FunctionHow to convert the bill dimensions from mm to cm?
# A tibble: 344 × 4
bill_length_mm bill_depth_mm bill_length_cm `bill_depth_mm/10`
<dbl> <dbl> <dbl> <dbl>
1 39.1 18.7 3.91 1.87
2 39.5 17.4 3.95 1.74
3 40.3 18 4.03 1.8
4 NA NA NA NA
5 36.7 19.3 3.67 1.93
6 39.3 20.6 3.93 2.06
7 38.9 17.8 3.89 1.78
8 39.2 19.6 3.92 1.96
9 34.1 18.1 3.41 1.81
10 42 20.2 4.2 2.02
# ℹ 334 more rows
05:00
arrange()
Function:Changes the ordering of the rows.
arrange()
FunctionHow to arrange data as per the bill length of the penguins?
penguins |>
arrange(bill_length_mm) #default is ascending order
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Dream 32.1 15.5 188 3050
2 Adelie Dream 33.1 16.1 178 2900
3 Adelie Torgersen 33.5 19 190 3600
4 Adelie Dream 34 17.1 185 3400
5 Adelie Torgersen 34.1 18.1 193 3475
6 Adelie Torgersen 34.4 18.4 184 3325
7 Adelie Biscoe 34.5 18.1 187 2900
8 Adelie Torgersen 34.6 21.1 198 4400
9 Adelie Torgersen 34.6 17.2 189 3200
10 Adelie Biscoe 35 17.9 190 3450
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
arrange()
FunctionHow to see five penguins of the least bill length?
# A tibble: 5 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Dream 32.1 15.5 188 3050
2 Adelie Dream 33.1 16.1 178 2900
3 Adelie Torgersen 33.5 19 190 3600
4 Adelie Dream 34 17.1 185 3400
5 Adelie Torgersen 34.1 18.1 193 3475
# ℹ 2 more variables: sex <fct>, year <int>
How to see five penguins of the highest bill length?
# A tibble: 5 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Gentoo Biscoe 55.9 17 228 5600
2 Chinstrap Dream 58 17.8 181 3700
3 Gentoo Biscoe 59.6 17 230 6050
4 Adelie Torgersen NA NA NA NA
5 Gentoo Biscoe NA NA NA NA
# ℹ 2 more variables: sex <fct>, year <int>
05:00
summarise()
Function:Reduces multiple values down to a single summary.
summarise()
Functionsummarise()
Functionsummarise()
Functionsummarise()
FunctionWhat is the species wise mean bill length of penguins and total number of penguins in each specie?
# A tibble: 3 × 3
species `mean(bill_length_mm)` n
<fct> <dbl> <int>
1 Adelie 38.8 146
2 Chinstrap 48.8 68
3 Gentoo 47.6 119
05:00
Title slide background image is from Joanna Kosinska.
R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. ebook link
R bloggers https://www.r-bloggers.com/
The R Project for Statistical Computing https://www.r-project.org/
posit (earlier RStudio) https://posit.co/
R packages for data science https://www.tidyverse.org/
Thank
You
Social Media #RStats
Artwork by Alision Horst