Quantcast
Channel: First steps - JuliaLang
Viewing all articles
Browse latest Browse all 2795

Read CSVs. Build DataFrame with column identifying the original CSV

$
0
0

I’ve been trying to learn Julia (1.4) by doing something I commonly do with R or Python:

  1. Have a bunch of CSVs, each with a year (YYYY) in the filename. Some datasets have columns that aren’t in the other datasets.
  2. Read each CSV to build a list of DataFrames
  3. Combine all DataFrames into a single DataFrame
    a. Make a Year column that contains the year associated with each original CSV.

I can’t seem to get 3a right in Julia. Here’s how I’d do it in R (Tidyverse):

library(dplyr)
library(purrr)
library(readr)
library(stringr)

# Assume there are only CSVs in the current working directory
files = list.files(path = ".", full.names = TRUE)

years = str_extract(files, "\\d{4}")
# Basically creates a dictionary, where each key is the year and
# the value is the filename
names(files) = years

# .id="Year" will create Year column from the keys. So rows from
# each DataFrame will have a Year value equal to the year from
# that CSV's filename.
#
# purrr::map_dfr allows column names to differ across DataFrames.
df = map_dfr(files, read_csv, .id = "Year")

Below is what I’ve tried in Julia. I’m missing a way to create a Year column in the final DataFrame, containing the year associated with each original CSV (task 3a above).

using CSV, DataFrames

# Assume there are only CSVs in the current working directory
files = readdir()

years = map(
    m -> String(m.match),
    match.(r"\d{4}", files),
)

df = mapreduce(
    x -> CSV.File(x) |> DataFrame,
    # Need cols=:union since columns aren't exactly the same in all
    # DataFrames
    (x, y) -> vcat(x, y, cols = :union),
    files,
)

Any tips?

7 posts - 7 participants

Read full topic


Viewing all articles
Browse latest Browse all 2795

Trending Articles