Quantcast
Channel: First steps - JuliaLang
Viewing all articles
Browse latest Browse all 2795

Directory Path -> URL Mining; output a Tree

$
0
0

@pontus wrote:

Asking for help with a particular task (showing an XML Sitemap as an Abstract Tree), and also for direction as to how to approach this topic more broadly. General goal is to learn to work with graphs structures, starting with a tree of parent/child relationships in a unidirected graph, and progressing on to other forms such as DAG’s, and perhaps getting to point of using ‘GraphModularDecomposion.jl’ that @StefanKarpinski developed and linked to here: Develop simple open source graph visualization library and for more inspiration, there is “The DAG of Julia packages” by @juliohm visualized here: https://juliohm.github.io/dataviz/DAG-of-Julia-packages/

Have learned to use ‘walkdir’ to create a tree view of a local directory (awesome!).
Essentially want to do the same thing, but with web URI’s.
To illustrate, desired output below was generated by building out the example structure in a local directory, and using the ‘fstree.jl’ example in the ‘AbstractTrees.jl’ package (https://github.com/Keno/AbstractTrees.jl/blob/master/examples/fstree.jl).

weburl_tree_example
└─ webaddress.tld
   ├─ category1
   │  ├─ page1
   │  │  ├─ file.csv
   │  │  ├─ file.txt
   │  │  └─ pdf_file.pdf
   │  └─ page2
   ├─ category2
   │  └─ page1
   └─ category3

Here is the code that generated that hierarchy output:

using AbstractTrees
import AbstractTrees: children, printnode

struct File
    path::String
end

children(f::File) = ()

struct Directory
    path::String
end

function children(d::Directory)
    contents = readdir(d.path)
    children = Vector{Union{Directory,File}}(undef,length(contents))
    for (i,c) in enumerate(contents)
        path = joinpath(d.path,c)
        children[i] = isdir(path) ? Directory(path) : File(path)
    end
    return children
end

printnode(io::IO, d::Directory) = print(io, basename(d.path))
printnode(io::IO, f::File) = print(io, basename(f.path))

#dirpath = realpath(joinpath(dirname(pathof(AbstractTrees)),".."))
#d = Directory(dirpath)
dirpath = pwd() # This assumes current working directory is aligned with tree to create.
d = Directory(dirpath)
print_tree(d)

Can’t attach an .xml file directly, so here is the example file as ouput from EzXML:
prettyprint(rootnode)

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://www.webaddress.tld/category1/</loc>
  </url>
  <url>
    <loc>https://www.webaddress.tld/category1/page1/</loc>
  </url>
  <url>
    <loc>https://www.webaddress.tld/category1/page1/file.csv</loc>
  </url>
  <url>
    <loc>https://www.webaddress.tld/category1/page1/file.txt</loc>
  </url>
  <url>
    <loc>https://www.webaddress.tld/category1/page1/pdf_file.pdf</loc>
  </url>
  <url>
    <loc>https://www.webaddress.tld/category1/page2/</loc>
  </url>
  <url>
    <loc>https://www.webaddress.tld/category2/</loc>
  </url>
  <url>
    <loc>https://www.webaddress.tld/category2/page1/</loc>
  </url>
  <url>
    <loc>https://www.webaddress.tld/category3/</loc>
  </url>
</urlset>

Perhaps could parse the xml file directly (and found bash script examples online to do this), but want to work within Julia, able to work with xml files in general.

One way to load the xml using EzXML is:
doc = readxml("sitemap_example.xml")

Went with a ‘streaming’ approach (need to use functions here, a work in progress!):

using EzXML
reader = open(EzXML.StreamReader, "sitemap_example.xml") # https://bicycle1885.github.io/EzXML.jl/stable/manual/# Streaming API
#urlset=Array{String,1} # setting type, so as to avoid expected error "MethodError: Cannot `convert` an object of type Array{Any,1} to an object of type DataFrame" when converting array to df for using CSV.write to save. 
@show reader.type # the initial state is READER_NONE; comment this line out once working
iterate(reader);  # advance the reader's state from READER_NONE to READER_ELEMENT
@show reader.type # show state is READER_ELEMENT; comment this line out once working
#@show reader.content # show the string of url's, comment this line out once working 
rawlist = reader.content;
close(reader)
#rawlist
#typeof(rawlist)
#strippedlist = strip(rawlist, "\n  \n    ") # MethodError: objects of type String are not callable
# so, try 'replace', but for multiple occurrences
# see: https://discourse.julialang.org/t/replacing-multiple-strings-errors/13654/9 for this method solved by @bkamins and @bennedich added the 'foldl' bit:
# reduce(replace, ["A"=>"a", "B"=>"b", "C"=>"c"], init="ABC")
# not sure, that form didn't work, this does:
replacedlist = (replace(rawlist, "\n  \n"=>""))
replacedlist = (replace(replacedlist, "    https://www."=>""))
urlset = split(replacedlist, "  \n")
replacedlist=nothing
rawlist=nothing
urlset

This is the output, which seems to be going in the right direction:
:
9-element Array{SubString{String},1}:
“webaddress.tld/category1/”
“webaddress.tld/category1/page1/”
“webaddress.tld/category1/page1/file.csv”
“webaddress.tld/category1/page1/file.txt”
“webaddress.tld/category1/page1/pdf_file.pdf”
“webaddress.tld/category1/page2/”
“webaddress.tld/category2/”
“webaddress.tld/category2/page1/”
“webaddress.tld/category3/”

I think using split creates an Array with SubString:
typeof(urlset)
Array{SubString{String},1}

Thought these would need to be broken apart, so:

for i in urlset
    line = split(i, "/")
    println(line)
end

Output:
SubString{String}[“webaddress.tld”, “category1”, “”]
SubString{String}[“webaddress.tld”, “category1”, “page1”, “”]
SubString{String}[“webaddress.tld”, “category1”, “page1”, “file.csv”]
SubString{String}[“webaddress.tld”, “category1”, “page1”, “file.txt”]
SubString{String}[“webaddress.tld”, “category1”, “page1”, “pdf_file.pdf”]
SubString{String}[“webaddress.tld”, “category1”, “page2”, “”]
SubString{String}[“webaddress.tld”, “category2”, “”]
SubString{String}[“webaddress.tld”, “category2”, “page1”, “”]
SubString{String}[“webaddress.tld”, “category3”, “”]

Am reading the AbstractTrees source to understand how Parent/Child relationships are identified and parsed to create a list of the relationships. Also looking at source by @tkoolen (https://github.com/JuliaRobotics/RigidBodyDynamics.jl/blob/master/src/graphs/directed_graph.jl and https://github.com/JuliaRobotics/RigidBodyDynamics.jl/blob/master/test/test_graph.jl).

  • Feel like this is going in right direction, but this has taken a good while and am floundering at this point.
  • Any suggestions & advice welcome!

Posts: 1

Participants: 1

Read full topic


Viewing all articles
Browse latest Browse all 2795

Latest Images

Trending Articles



Latest Images