JC Tips and Tricks - Part 2

URL Parsing | POSIX Path Parsing | Convert Git Logs

In my last JC Tips and Tricks post I discussed using jc as a subnet calculator, for exploring X.509 certs, and converting different types of timestamp strings.

In this post I’ll go over some more use-cases:

Parsing URLs
Parsing POSIX Paths
Exploring Git Logs

URL Parsing

URLs can be notoriously difficult to parse. URLs and URIs can be encoded and can include a surprising number of attributes, including authentication credentials, a path, a query that can include multiple values to a key, and more!

When you pipe a URL into jc it will explode all of the URL parts into a nice, tidy object. In addition, you can encode a URL or decode an already encoded URL. This allows you to easily pull any value with a tool like jq or jello.

% echo 'https://www.example.com/this%20is%20a%20path/parent/form.htm?mykey=value1&mykey=value2' | jc --url --pretty
{
  "url": "https://www.example.com/this%20is%20a%20path/parent/form.htm?mykey=value1&mykey=value2",
  "scheme": "https",
  "netloc": "www.example.com",
  "path": "/this%20is%20a%20path/parent/form.htm",
  "parent": "/this%20is%20a%20path/parent",
  "filename": "form.htm",
  "stem": "form",
  "extension": "htm",
  "path_list": [
    "this%20is%20a%20path",
    "parent",
    "form.htm"
  ],
  "query": "mykey=value1&mykey=value2",
  "query_obj": {
    "mykey": [
      "value1",
      "value2"
    ]
  },
  "fragment": null,
  "username": null,
  "password": null,
  "hostname": "www.example.com",
  "port": null,
  "encoded": {
    "url": "https://www.example.com/this%2520is%2520a%2520path/parent/form.htm?mykey=value1&mykey=value2",
    "scheme": "https",
    "netloc": "www.example.com",
    "path": "/this%2520is%2520a%2520path/parent/form.htm",
    "parent": "/this%2520is%2520a%2520path/parent",
    "filename": "form.htm",
    "stem": "form",
    "extension": "htm",
    "path_list": [
      "this%2520is%2520a%2520path",
      "parent",
      "form.htm"
    ],
    "query": "mykey=value1&mykey=value2",
    "fragment": null,
    "username": null,
    "password": null,
    "hostname": "www.example.com",
    "port": null
  },
  "decoded": {
    "url": "https://www.example.com/this is a path/parent/form.htm?mykey=value1&mykey=value2",
    "scheme": "https",
    "netloc": "www.example.com",
    "path": "/this is a path/parent/form.htm",
    "parent": "/this is a path/parent",
    "filename": "form.htm",
    "stem": "form",
    "extension": "htm",
    "path_list": [
      "this is a path",
      "parent",
      "form.htm"
    ],
    "query": "mykey=value1&mykey=value2",
    "fragment": null,
    "username": null,
    "password": null,
    "hostname": "www.example.com",
    "port": null
  }
}

Notice that you get the original, encoded, and decoded URL information. In the example above, the path includes encoded spaces which we can decode as follows:

% echo 'https://www.example.com/this%20is%20a%20path/parent/form.htm?mykey=value1&mykey=value2' | jc --url | jq -r '.decoded.path'
/this is a path/parent/form.htm

The URL above also includes a query – but not just any query. It is perfectly legal to include the same key-name multiple times, which assigns multiple values to a single key. This is handled correctly by jc:

% echo 'https://www.example.com/this%20is%20a%20path/parent/form.htm?mykey=value1&mykey=value2' | jc --url | jq '.query_obj' 
{
  "mykey": [
    "value1",
    "value2"
  ]
}

This allows you to grab one or more values from the above:

% echo 'https://www.example.com/this%20is%20a%20path/parent/form.htm?mykey=value1&mykey=value2' | jc --url | jq -r '.query_obj.mykey[1]'
value2

You can also encode a URL so it can be safely used on the web:

% echo 'https://www.example.com/this path has/spaces in it/' | jc --url | jq -r '.encoded.url'
https://www.example.com/this%20path%20has/spaces%20in%20it/

The URL parser correctly handles authentication, IPv6, and port information as well:

% echo 'https://myusername:mypassword@[1:2::127]:8000/index.htm' | jc --url --pretty
{
  "url": "https://myusername:mypassword@[1:2::127]:8000/index.htm",
  "scheme": "https",
  "netloc": "myusername:mypassword@[1:2::127]:8000",
  "path": "/index.htm",
  "parent": "/",
  "filename": "index.htm",
  "stem": "index",
  "extension": "htm",
  "path_list": [
    "index.htm"
  ],
  "query": null,
  "query_obj": null,
  "fragment": null,
  "username": "myusername",
  "password": "mypassword",
  "hostname": "1:2::127",
  "port": 8000,
<snip>
}

Now you don’t have to worry about writing a complex regex to correctly handle strange corner cases around URL parsing in your scripts!

POSIX Path Parsing

Graphic from https://miguendes.me/python-pathlib

Sometimes you need to parse POSIX compliant paths in your scripts or even a list of paths from the $PATH environment variable. This can be a pain because there are a number of edge-cases, including the fact that paths and filenames can include spaces. the --path parser in jc will deconstruct the path into all of its parts:

% echo '/home/joeuser/git/my project/app.py' | jc --path --pretty 
{
  "path": "/home/joeuser/git/my project/app.py",
  "parent": "/home/joeuser/git/my project",
  "filename": "app.py",
  "stem": "app",
  "extension": "py",
  "path_list": [
    "/",
    "home",
    "joeuser",
    "git",
    "my project",
    "app.py"
  ]
}

The path is broken down into its various parts, including the parent, filename, stem (filename without the extension), and extension. A path_list field is also included that gives you an array of all of the path parts so you can easily pull a value:

% echo '/home/joeuser/git/my project/app.py' | jc --path | jq -r '.path_list[-2]'
my project

The --path parser works with Windows paths, too:

% echo 'C:\Windows\Program Files\xfolder\file.txt' | jc --path -pretty
{
  "path": "C:\\Windows\\Program Files\\xfolder\\file.txt",
  "parent": "C:\\Windows\\Program Files\\xfolder",
  "filename": "file.txt",
  "stem": "file",
  "extension": "txt",
  "path_list": [
    "C:\\",
    "Windows",
    "Program Files",
    "xfolder",
    "file.txt"
  ]
}

In addition, jc can parse the path list from the $PATH environment variable with the --path-list parser. You get all of the same information above but with each path added to an array:

% echo $PATH
/Users/joeuser/.pyenv/shims:/Users/joeuser/.gem/ruby/2.6.0/bin:/Users/joeuser/.local/bin:/Users/joeuser/.cargo/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin

% echo $PATH | jc --path-list --pretty
[
  {
    "path": "/Users/joeuser/.pyenv/shims",
    "parent": "/Users/joeuser/.pyenv",
    "filename": "shims",
    "stem": "shims",
    "extension": "",
    "path_list": [
      "/",
      "Users",
      "joeuser",
      ".pyenv",
      "shims"
    ]
  },
  {
    "path": "/Users/joeuser/.gem/ruby/2.6.0/bin",
    "parent": "/Users/joeuser/.gem/ruby/2.6.0",
    "filename": "bin",
    "stem": "bin",
    "extension": "",
    "path_list": [
      "/",
      "Users",
      "joeuser",
      ".gem",
      "ruby",
      "2.6.0",
      "bin"
    ]
  },
<snip>
]

Finally, you can use the --slurp option which will allow you to parse a line-delimited list of paths using the --path parser:

% cat paths.txt 
/usr/local/bin
/home/joeuser
/var/log/system

% cat paths.txt | jc --path --slurp --pretty
[
  {
    "path": "/usr/local/bin",
    "parent": "/usr/local",
    "filename": "bin",
    "stem": "bin",
    "extension": "",
    "path_list": [
      "/",
      "usr",
      "local",
      "bin"
    ]
  },
  {
    "path": "/home/joeuser",
    "parent": "/home",
    "filename": "joeuser",
    "stem": "joeuser",
    "extension": "",
    "path_list": [
      "/",
      "home",
      "joeuser"
    ]
  },
  {
    "path": "/var/log/system",
    "parent": "/var/log",
    "filename": "system",
    "stem": "system",
    "extension": "",
    "path_list": [
      "/",
      "var",
      "log",
      "system"
    ]
  }
]

As you can see, parsing paths with jc gives you more control that just using basename or dirname in your scripts.

Exploring Git Log Output

One of my most popular blog posts is about converting Git log output to JSON. This provides an easier way of exploring git log output in scripts or saving the output as a structured object for later querying.

% git log --format=fuller --stat | jc --git-log --pretty
[
  {
    "commit": "4bed8392b83bc5ebc55238ab516a19cfafba4bfa",
    "author": "Kelly Brazil",
    "author_email": "kellyjonbrazil@gmail.com",
    "date": "Tue Feb 20 08:56:53 2024 -0800",
    "commit_by": "Kelly Brazil",
    "commit_by_email": "kellyjonbrazil@gmail.com",
    "commit_by_date": "Tue Feb 20 08:56:53 2024 -0800",
    "stats": {
      "files_changed": 3,
      "insertions": 3,
      "deletions": 3,
      "files": [
        "README.md",
        "man/jc.1",
        "templates/readme_template"
      ]
    },
    "message": "update release notes section",
    "epoch": 1708448213,
    "epoch_utc": null
  },
  {
    "commit": "8e2bcba35230079c8f8c3e741f840a4e68e354af",
    "author": "Kelly Brazil",
    "author_email": "kellyjonbrazil@gmail.com",
    "date": "Wed Feb 14 15:45:18 2024 -0800",
    "commit_by": "Kelly Brazil",
    "commit_by_email": "kellyjonbrazil@gmail.com",
    "commit_by_date": "Wed Feb 14 15:45:18 2024 -0800",
    "stats": {
      "files_changed": 2,
      "insertions": 4,
      "deletions": 4,
      "files": [
        "docs/parsers/proc.md",
        "jc/parsers/proc.py"
      ]
    },
    "message": "use get_parser instead of importlib",
    "epoch": 1707954318,
    "epoch_utc": null
  },
<snip>
]

Now you can easily query the data. For example, this is how you can get a list of commits that have more than 20 files changed:

% git log --format=fuller --stat | jc --git-log | jq '.[] | select(.stats.files_changed > 20)'
{
  "commit": "c332c4febf2bf757f662bc2cb22ec20fe0b4bcd0",
  "author": "Muescha",
  "author_email": "184316+muescha@users.noreply.github.com",
  "date": "Wed Jan 31 05:04:55 2024 +0100",
  "commit_by": "Kelly Brazil",
  "commit_by_email": "kellyjonbrazil@gmail.com",
  "commit_by_date": "Tue Feb 6 01:54:31 2024 +0000",
  "stats": {
    "files_changed": 28,
    "insertions": 1041,
    "deletions": 1,
    "files": [
      "CHANGELOG",
      "jc/lib.py",
      "jc/parsers/path.py",
      "jc/parsers/path_list.py",
      "tests/fixtures/generic/path--long.json",
      "tests/fixtures/generic/path--long.out",
      "tests/fixtures/generic/path--one.json",
      "tests/fixtures/generic/path--one.out",
      "tests/fixtures/generic/path--windows.json",
      "tests/fixtures/generic/path--windows.out",
      "tests/fixtures/generic/path--with-spaces.json",
      "tests/fixtures/generic/path--with-spaces.out",
      "tests/fixtures/generic/path_list--long.json",
      "tests/fixtures/generic/path_list--long.out",
      "tests/fixtures/generic/path_list--one.json",
      "tests/fixtures/generic/path_list--one.out",
      "tests/fixtures/generic/path_list--two.json",
      "tests/fixtures/generic/path_list--two.out",
      ".../generic/path_list--windows-environment.json",
      ".../generic/path_list--windows-environment.out",
      ".../fixtures/generic/path_list--windows-long.json",
      "tests/fixtures/generic/path_list--windows-long.out",
      "tests/fixtures/generic/path_list--windows.json",
      "tests/fixtures/generic/path_list--windows.out",
      "tests/fixtures/generic/path_list--with-spaces.json",
      "tests/fixtures/generic/path_list--with-spaces.out",
      "tests/test_path.py",
      "tests/test_path_list.py"
    ]
  },
  "message": "draft for path and path_list (#513)\n\n* draft for path_list\n\n* updaate doc\n\n* add input check\n\n* fix types\n\n* fix schema: add missing properties\n\n* add _process\n\n* fix _process docs\n\n* refactor: extract path.py parser\n\n* swap order of names alphabetically\n\n* documentation and comments\n\n* path parser: add early return for nodata\n\n* path and path-list parser: add test and fixtures\n\n* typo in file name\n\n* add early return for nodata\n\n* add test and fixtures\n\n* typo in file name\n\n* rename fixtures\n\n* rename fixtures\n\n* refactor to pathlib.Path\n\n* failing on windows - use PurePosixPath\n\n* changed the way to strip dot from suffix\n\n* add POSIX to path\n\n* test commit to see results on windows is failing\n\n* test commit to see results on windows is failing\n\n* add windows path detection\n\n* somehow Path not like the newline from input line\n\n* add test with more items\n\n* remove debug print\n\n* wrap test loops into into subTest\n\n* remove print statements\n\n* add path and path-list to CHANGELOG\n\n---------\n\nCo-authored-by: Kelly Brazil <kellyjonbrazil@gmail.com>",
  "epoch": 1706706295,
  "epoch_utc": null
}
{
  "commit": "2d5d87c73db538acdb0bcc58b8f250febe0ecbab",
  "author": "Kelly Brazil",
  "author_email": "kellyjonbrazil@gmail.com",
  "date": "Wed Jan 3 15:57:08 2024 -0800",
  "commit_by": "Kelly Brazil",
  "commit_by_email": "kellyjonbrazil@gmail.com",
  "commit_by_date": "Tue Feb 6 01:54:31 2024 +0000",
  "stats": {
    "files_changed": 21,
    "insertions": 76,
    "deletions": 21,
    "files": [
      "CHANGELOG",
      "README.md",
      "completions/jc_bash_completion.sh",
      "completions/jc_zsh_completion.sh",
      "docgen.sh",
      "docs/lib.md",
      "docs/parsers/date.md",
      "docs/parsers/datetime_iso.md",
      "docs/parsers/email_address.md",
      "docs/parsers/ip_address.md",
      "docs/parsers/jwt.md",
      "docs/parsers/semver.md",
      "docs/parsers/timestamp.md",
      "docs/parsers/url.md",
      "docs/parsers/ver.md",
      "jc/cli.py",
      "jc/lib.py",
      "man/jc.1",
      "setup.py",
      "templates/manpage_template",
      "templates/readme_template"
    ]
  },
  "message": "version bump and doc update",
  "epoch": 1704326228,
  "epoch_utc": null
}
<snip>

Or you can see all of the commits that updated the CHANGELOG file:

% git log --format=fuller --stat | jc --git-log | jq '.[] | select(.stats.files[]? | contains("CHANGELOG"))'
{
  "commit": "25085c3412eb1bf8e9c31ea9ae788c671e409585",
  "author": "Kelly Brazil",
  "author_email": "kellyjonbrazil@gmail.com",
  "date": "Wed Feb 14 15:25:22 2024 -0800",
  "commit_by": "Kelly Brazil",
  "commit_by_email": "kellyjonbrazil@gmail.com",
  "commit_by_date": "Wed Feb 14 15:25:22 2024 -0800",
  "stats": {
    "files_changed": 4,
    "insertions": 5,
    "deletions": 4,
    "files": [
      "CHANGELOG",
      "docs/parsers/iwconfig.md",
      "jc/parsers/iwconfig.py",
      "man/jc.1"
    ]
  },
  "message": "doc update",
  "epoch": 1707953122,
  "epoch_utc": null
}
{
  "commit": "6275591ef1a02612feb06ce4819809a4bf118ab1",
  "author": "Kelly Brazil",
  "author_email": "kellyjonbrazil@gmail.com",
  "date": "Mon Feb 12 21:31:10 2024 -0800",
  "commit_by": "Kelly Brazil",
  "commit_by_email": "kellyjonbrazil@gmail.com",
  "commit_by_date": "Mon Feb 12 21:31:10 2024 -0800",
  "stats": {
    "files_changed": 7,
    "insertions": 12,
    "deletions": 9,
    "files": [
      "CHANGELOG",
      "docs/parsers/proc.md",
      "jc/lib.py",
      "jc/parsers/proc.py",
      "man/jc.1",
      "setup.py",
      "templates/manpage_template"
    ]
  },
  "message": "version bump and doc updates",
  "epoch": 1707802270,
  "epoch_utc": null
}
<snip>

This is much more convenient than using grep or awk and finding the correct git format string.

That’s it for this post – more to come! Let me know if you have a favorite jc use case you would like covered in the future.

JC Tips and Tricks – Part 2

URL Parsing

POSIX Path Parsing

Exploring Git Log Output

Like this:

Published by kellyjonbrazil

Leave a ReplyCancel reply

URL Parsing

POSIX Path Parsing

Exploring Git Log Output

Like this:

Published by kellyjonbrazil

Leave a ReplyCancel reply

Discover more from Brazil's Blog