I’ve been trying to learn Julia (1.4) by doing something I commonly do with R or Python:
- Have a bunch of CSVs, each with a year (YYYY) in the filename. Some datasets have columns that aren’t in the other datasets.
- Read each CSV to build a list of DataFrames
- Combine all DataFrames into a single DataFrame
a. Make a Year column that contains the year associated with each original CSV.
I can’t seem to get 3a right in Julia. Here’s how I’d do it in R (Tidyverse):
library(dplyr)
library(purrr)
library(readr)
library(stringr)
# Assume there are only CSVs in the current working directory
files = list.files(path = ".", full.names = TRUE)
years = str_extract(files, "\\d{4}")
# Basically creates a dictionary, where each key is the year and
# the value is the filename
names(files) = years
# .id="Year" will create Year column from the keys. So rows from
# each DataFrame will have a Year value equal to the year from
# that CSV's filename.
#
# purrr::map_dfr allows column names to differ across DataFrames.
df = map_dfr(files, read_csv, .id = "Year")
Below is what I’ve tried in Julia. I’m missing a way to create a Year column in the final DataFrame, containing the year associated with each original CSV (task 3a above).
using CSV, DataFrames
# Assume there are only CSVs in the current working directory
files = readdir()
years = map(
m -> String(m.match),
match.(r"\d{4}", files),
)
df = mapreduce(
x -> CSV.File(x) |> DataFrame,
# Need cols=:union since columns aren't exactly the same in all
# DataFrames
(x, y) -> vcat(x, y, cols = :union),
files,
)
Any tips?
7 posts - 7 participants