[2/2] post_pw: Store submitted checks locally as well

Message ID 20240122172635.3641078-3-aconole@redhat.com (mailing list archive)
State New
Headers
Series Reduced checks API usage |

Commit Message

Aaron Conole Jan. 22, 2024, 5:26 p.m. UTC
  Jeremy Kerr reports that our PW checks reporting submitted 43000 API calls
in just a single day.  That is alarmingly unacceptable.  We can store the
URLs we've already submitted and then just skip over any additional
processing at least on the PW side.

This patch does two things to try and mitigate this issue:

1. Store each patch ID and URL in the series DB to show that we reported
   the check.  This means we don't need to poll patchwork for check status

2. Store the last modified time of the reports mailing list.  This means
   we only poll the mailing list when a new email has surely landed.

Signed-off-by: Aaron Conole <aconole@redhat.com>
---
 post_pw.sh       | 35 ++++++++++++++++++++++++++++++++++-
 series_db_lib.sh | 25 +++++++++++++++++++++++++
 2 files changed, 59 insertions(+), 1 deletion(-)
  

Comments

Michael Santana Jan. 22, 2024, 6:16 p.m. UTC | #1
On Mon, Jan 22, 2024 at 12:26 PM Aaron Conole <aconole@redhat.com> wrote:
>
> Jeremy Kerr reports that our PW checks reporting submitted 43000 API calls
> in just a single day.  That is alarmingly unacceptable.  We can store the
> URLs we've already submitted and then just skip over any additional
> processing at least on the PW side.
>
> This patch does two things to try and mitigate this issue:
>
> 1. Store each patch ID and URL in the series DB to show that we reported
>    the check.  This means we don't need to poll patchwork for check status
Yeah, we should have done that from the start. I should have thought
about this sooner.
>
> 2. Store the last modified time of the reports mailing list.  This means
>    we only poll the mailing list when a new email has surely landed.
>
> Signed-off-by: Aaron Conole <aconole@redhat.com>
> ---
>  post_pw.sh       | 35 ++++++++++++++++++++++++++++++++++-
>  series_db_lib.sh | 25 +++++++++++++++++++++++++
>  2 files changed, 59 insertions(+), 1 deletion(-)
>
> diff --git a/post_pw.sh b/post_pw.sh
> index fe2f41c..3e3a493 100755
> --- a/post_pw.sh
> +++ b/post_pw.sh
> @@ -20,6 +20,7 @@
>  # License for the specific language governing permissions and limitations
>  # under the License.
>
> +[ -f "$(dirname $0)/series_db_lib.sh" ] && source "$(dirname $0)/series_db_lib.sh" || exit 1
>  [ -f "${HOME}/.mail_patchwork_sync.rc" ] && source "${HOME}/.mail_patchwork_sync.rc"
>
>  # Patchwork instance to update with new reports from mailing list
> @@ -75,6 +76,13 @@ send_post() {
>      if [ -z "$context" -o -z "$state" -o -z "$description" -o -z "$patch_id" ]; then
>          echo "Skpping \"$link\" due to missing context, state, description," \
>               "or patch_id" 1>&2
> +        # Just don't want to even bother seeing these "bad" patches as well.
> +        add_check_scanned_url "$patch_id" "$target_url"
> +        return 0
> +    fi
> +
> +    if check_id_exists "$patch_id" "$target_url" ; then
> +        echo "Skipping \"$link\" - already reported." 1>&2
>          return 0
>      fi
>
> @@ -84,6 +92,7 @@ send_post() {
>          "$api_url")"
>      if [ $? -ne 0 ]; then
>          echo "Failed to get proper server response on link ${api_url}" 1>&2
> +        # Don't store these as processed in case the server has a temporary issue.
>          return 0
>      fi
>
> @@ -95,6 +104,9 @@ send_post() {
>      jq -e "[.[].target_url] | contains([\"$mail_url\"])" >/dev/null
>      then
>          echo "Report ${target_url} already pushed to patchwork. Skipping." 1>&2
> +        # Somehow this was not stored (for example, first time we apply the tracking
> +        # feature).  Store it now.
> +        add_check_scanned_url "$patch_id" "$target_url"
>          return 0
>      fi
>
> @@ -114,12 +126,31 @@ send_post() {
>      if [ $? -ne 0 ]; then
>          echo -e "Failed to push retults based on report ${link} to the"\
>                  "patchwork instance ${pw_instance} using the following REST"\
> -                "API Endpoint ${api_url} with the following data:\n$data\n"
> +                "API Endpoint ${api_url} with the following data:\n$data\n" 1>&2
>          return 0
>      fi
> +
> +    add_check_scanned_url "$patch_id" "$target_url"
>  }
>
> +# Collect the date.  NOTE: this needs some accomodate to catch the month change-overs
>  year_month="$(date +"%Y-%B")"
> +
> +# Get the last modified time
> +report_last_mod=$(curl -A "pw-post" -sSf "${mail_archive}${year_month}/thread.html" | grep Last-Modified)
please use grep -i. I doubt Last-Modified will ever change. But better
be on the safe side
> +
> +mailing_list_save_file=$(echo ".post_pw_${mail_archive}${year_month}" | sed -e "s@/@_@g" -e "s@:@_@g" -e "s,@,_,g")
> +
> +if [ -e "${HOME}/${mailing_list_save_file}" ]; then
> +    last_read_date=$(cat "${HOME}/${mailing_list_save_file}")
wait.... what? Please correct me if I am misunderstanding this

But I dont think that $last_read_date or $mailing_list_save_file ever
get populated. They are always empty strings/file.

I think you are missing a last_read_date="$report_last_mod" somewhere
in this code. It might not be in the place that I mentioned below, but
I am fairly certain you are missing it somewhere
> +    if [ "$last_read_date" == "$report_last_mod" ]; then
> +        echo "Last modified times match.  Skipping list parsing."
> +        exit 0
> +    fi
Maybe if we change this if to an else like this

else
    "$last_read_date"="$report_last_mod"
fi

> +else
> +    touch "${HOME}/${mailing_list_save_file}"
> +fi
> +
>  reports="$(curl -A "pw-post" -sSf "${mail_archive}${year_month}/thread.html" | \
>           grep -i 'HREF=' | sed -e 's@[0-9]*<LI><A HREF="@\|@' -e 's@">@\|@')"
>  if [ $? -ne 0 ]; then
> @@ -132,3 +163,5 @@ echo "$reports" | while IFS='|' read -r blank link title; do
>          send_post "${mail_archive}${year_month}/$link"
>      fi
>  done
> +
> +echo "$last_read_date" > "${HOME}/${mailing_list_save_file}"
> diff --git a/series_db_lib.sh b/series_db_lib.sh
> index c5f42e0..0635469 100644
> --- a/series_db_lib.sh
> +++ b/series_db_lib.sh
> @@ -130,6 +130,17 @@ recheck_sync INTEGER
>  EOF
>          run_db_command "INSERT INTO series_schema_version(id) values (8);"
>      fi
> +
> +    run_db_command "select * from series_schema_version;" | egrep '^9$' > /dev/null 2>&1
> +    if [ $? -eq 1 ]; then
> +        sqlite3 ${HOME}/.series-db <<EOF
> +CREATE TABLE check_id_scanned (
> +check_patch_id INTEGER,
> +check_url STRING
> +)
> +EOF
> +        run_db_command "INSERT INTO series_schema_version(id) values (9);"
> +    fi
>  }
>
>  function series_db_exists() {
> @@ -468,3 +479,17 @@ function set_recheck_request_state() {
>
>      echo "UPDATE recheck_requests set recheck_sync=$recheck_state where patchwork_instance=\"$recheck_instance\" and patchwork_project=\"$recheck_project\" and recheck_requested_by=\"$recheck_requested_by\" and recheck_series=\"$recheck_series\";" | series_db_execute
>  }
> +
> +function add_check_scanned_url() {
> +    local patch_id="$1"
> +    local url="$2"
> +
> +    echo "INSERT into check_id_scanned (check_patch_id, check_url) values (${patch_id}, \"$url\");" | series_db_execute
> +}
> +
> +function check_id_exists() {
> +    local patch_id="$1"
> +    local url="$2"
> +
> +    echo "select * from check_id_scanned where check_patch_id=$patch_id and check_url=\"$url\";" | series_db_execute | grep "$url" >/dev/null 2>&1
> +}
> --
> 2.41.0
>
  
Aaron Conole Jan. 22, 2024, 7:31 p.m. UTC | #2
Michael Santana <msantana@redhat.com> writes:

> On Mon, Jan 22, 2024 at 12:26 PM Aaron Conole <aconole@redhat.com> wrote:
>>
>> Jeremy Kerr reports that our PW checks reporting submitted 43000 API calls
>> in just a single day.  That is alarmingly unacceptable.  We can store the
>> URLs we've already submitted and then just skip over any additional
>> processing at least on the PW side.
>>
>> This patch does two things to try and mitigate this issue:
>>
>> 1. Store each patch ID and URL in the series DB to show that we reported
>>    the check.  This means we don't need to poll patchwork for check status
> Yeah, we should have done that from the start. I should have thought
> about this sooner.
>>
>> 2. Store the last modified time of the reports mailing list.  This means
>>    we only poll the mailing list when a new email has surely landed.
>>
>> Signed-off-by: Aaron Conole <aconole@redhat.com>
>> ---
>>  post_pw.sh       | 35 ++++++++++++++++++++++++++++++++++-
>>  series_db_lib.sh | 25 +++++++++++++++++++++++++
>>  2 files changed, 59 insertions(+), 1 deletion(-)
>>
>> diff --git a/post_pw.sh b/post_pw.sh
>> index fe2f41c..3e3a493 100755
>> --- a/post_pw.sh
>> +++ b/post_pw.sh
>> @@ -20,6 +20,7 @@
>>  # License for the specific language governing permissions and limitations
>>  # under the License.
>>
>> +[ -f "$(dirname $0)/series_db_lib.sh" ] && source "$(dirname $0)/series_db_lib.sh" || exit 1
>>  [ -f "${HOME}/.mail_patchwork_sync.rc" ] && source "${HOME}/.mail_patchwork_sync.rc"
>>
>>  # Patchwork instance to update with new reports from mailing list
>> @@ -75,6 +76,13 @@ send_post() {
>>      if [ -z "$context" -o -z "$state" -o -z "$description" -o -z "$patch_id" ]; then
>>          echo "Skpping \"$link\" due to missing context, state, description," \
>>               "or patch_id" 1>&2
>> +        # Just don't want to even bother seeing these "bad" patches as well.
>> +        add_check_scanned_url "$patch_id" "$target_url"
>> +        return 0
>> +    fi
>> +
>> +    if check_id_exists "$patch_id" "$target_url" ; then
>> +        echo "Skipping \"$link\" - already reported." 1>&2
>>          return 0
>>      fi
>>
>> @@ -84,6 +92,7 @@ send_post() {
>>          "$api_url")"
>>      if [ $? -ne 0 ]; then
>>          echo "Failed to get proper server response on link ${api_url}" 1>&2
>> +        # Don't store these as processed in case the server has a temporary issue.
>>          return 0
>>      fi
>>
>> @@ -95,6 +104,9 @@ send_post() {
>>      jq -e "[.[].target_url] | contains([\"$mail_url\"])" >/dev/null
>>      then
>>          echo "Report ${target_url} already pushed to patchwork. Skipping." 1>&2
>> +        # Somehow this was not stored (for example, first time we apply the tracking
>> +        # feature).  Store it now.
>> +        add_check_scanned_url "$patch_id" "$target_url"
>>          return 0
>>      fi
>>
>> @@ -114,12 +126,31 @@ send_post() {
>>      if [ $? -ne 0 ]; then
>>          echo -e "Failed to push retults based on report ${link} to the"\
>>                  "patchwork instance ${pw_instance} using the following REST"\
>> -                "API Endpoint ${api_url} with the following data:\n$data\n"
>> +                "API Endpoint ${api_url} with the following data:\n$data\n" 1>&2
>>          return 0
>>      fi
>> +
>> +    add_check_scanned_url "$patch_id" "$target_url"
>>  }
>>
>> +# Collect the date.  NOTE: this needs some accomodate to catch the month change-overs
>>  year_month="$(date +"%Y-%B")"
>> +
>> +# Get the last modified time
>> +report_last_mod=$(curl -A "pw-post" -sSf "${mail_archive}${year_month}/thread.html" | grep Last-Modified)
> please use grep -i. I doubt Last-Modified will ever change. But better
> be on the safe side

Okay.

>> +
>> +mailing_list_save_file=$(echo
>> ".post_pw_${mail_archive}${year_month}" | sed -e "s@/@_@g" -e
>> "s@:@_@g" -e "s,@,_,g")
>> +
>> +if [ -e "${HOME}/${mailing_list_save_file}" ]; then
>> +    last_read_date=$(cat "${HOME}/${mailing_list_save_file}")
> wait.... what? Please correct me if I am misunderstanding this
>
> But I dont think that $last_read_date or $mailing_list_save_file ever
> get populated. They are always empty strings/file.
>
> I think you are missing a last_read_date="$report_last_mod" somewhere
> in this code. It might not be in the place that I mentioned below, but
> I am fairly certain you are missing it somewhere

Good catch - my testing was with manually editing the files.

>> +    if [ "$last_read_date" == "$report_last_mod" ]; then
>> +        echo "Last modified times match.  Skipping list parsing."
>> +        exit 0
>> +    fi
> Maybe if we change this if to an else like this
>
> else
>     "$last_read_date"="$report_last_mod"
> fi

Done.

>> +else
>> +    touch "${HOME}/${mailing_list_save_file}"
>> +fi
>> +
>>  reports="$(curl -A "pw-post" -sSf "${mail_archive}${year_month}/thread.html" | \
>>           grep -i 'HREF=' | sed -e 's@[0-9]*<LI><A HREF="@\|@' -e 's@">@\|@')"
>>  if [ $? -ne 0 ]; then
>> @@ -132,3 +163,5 @@ echo "$reports" | while IFS='|' read -r blank link title; do
>>          send_post "${mail_archive}${year_month}/$link"
>>      fi
>>  done
>> +
>> +echo "$last_read_date" > "${HOME}/${mailing_list_save_file}"
>> diff --git a/series_db_lib.sh b/series_db_lib.sh
>> index c5f42e0..0635469 100644
>> --- a/series_db_lib.sh
>> +++ b/series_db_lib.sh
>> @@ -130,6 +130,17 @@ recheck_sync INTEGER
>>  EOF
>>          run_db_command "INSERT INTO series_schema_version(id) values (8);"
>>      fi
>> +
>> +    run_db_command "select * from series_schema_version;" | egrep '^9$' > /dev/null 2>&1
>> +    if [ $? -eq 1 ]; then
>> +        sqlite3 ${HOME}/.series-db <<EOF
>> +CREATE TABLE check_id_scanned (
>> +check_patch_id INTEGER,
>> +check_url STRING
>> +)
>> +EOF
>> +        run_db_command "INSERT INTO series_schema_version(id) values (9);"
>> +    fi
>>  }
>>
>>  function series_db_exists() {
>> @@ -468,3 +479,17 @@ function set_recheck_request_state() {
>>
>>      echo "UPDATE recheck_requests set recheck_sync=$recheck_state where patchwork_instance=\"$recheck_instance\" and patchwork_project=\"$recheck_project\" and recheck_requested_by=\"$recheck_requested_by\" and recheck_series=\"$recheck_series\";" | series_db_execute
>>  }
>> +
>> +function add_check_scanned_url() {
>> +    local patch_id="$1"
>> +    local url="$2"
>> +
>> +    echo "INSERT into check_id_scanned (check_patch_id, check_url) values (${patch_id}, \"$url\");" | series_db_execute
>> +}
>> +
>> +function check_id_exists() {
>> +    local patch_id="$1"
>> +    local url="$2"
>> +
>> +    echo "select * from check_id_scanned where check_patch_id=$patch_id and check_url=\"$url\";" | series_db_execute | grep "$url" >/dev/null 2>&1
>> +}
>> --
>> 2.41.0
>>
  

Patch

diff --git a/post_pw.sh b/post_pw.sh
index fe2f41c..3e3a493 100755
--- a/post_pw.sh
+++ b/post_pw.sh
@@ -20,6 +20,7 @@ 
 # License for the specific language governing permissions and limitations
 # under the License.
 
+[ -f "$(dirname $0)/series_db_lib.sh" ] && source "$(dirname $0)/series_db_lib.sh" || exit 1
 [ -f "${HOME}/.mail_patchwork_sync.rc" ] && source "${HOME}/.mail_patchwork_sync.rc"
 
 # Patchwork instance to update with new reports from mailing list
@@ -75,6 +76,13 @@  send_post() {
     if [ -z "$context" -o -z "$state" -o -z "$description" -o -z "$patch_id" ]; then
         echo "Skpping \"$link\" due to missing context, state, description," \
              "or patch_id" 1>&2
+        # Just don't want to even bother seeing these "bad" patches as well.
+        add_check_scanned_url "$patch_id" "$target_url"
+        return 0
+    fi
+
+    if check_id_exists "$patch_id" "$target_url" ; then
+        echo "Skipping \"$link\" - already reported." 1>&2
         return 0
     fi
 
@@ -84,6 +92,7 @@  send_post() {
         "$api_url")"
     if [ $? -ne 0 ]; then
         echo "Failed to get proper server response on link ${api_url}" 1>&2
+        # Don't store these as processed in case the server has a temporary issue.
         return 0
     fi
 
@@ -95,6 +104,9 @@  send_post() {
     jq -e "[.[].target_url] | contains([\"$mail_url\"])" >/dev/null
     then
         echo "Report ${target_url} already pushed to patchwork. Skipping." 1>&2
+        # Somehow this was not stored (for example, first time we apply the tracking
+        # feature).  Store it now.
+        add_check_scanned_url "$patch_id" "$target_url"
         return 0
     fi
 
@@ -114,12 +126,31 @@  send_post() {
     if [ $? -ne 0 ]; then
         echo -e "Failed to push retults based on report ${link} to the"\
                 "patchwork instance ${pw_instance} using the following REST"\
-                "API Endpoint ${api_url} with the following data:\n$data\n"
+                "API Endpoint ${api_url} with the following data:\n$data\n" 1>&2
         return 0
     fi
+
+    add_check_scanned_url "$patch_id" "$target_url"
 }
 
+# Collect the date.  NOTE: this needs some accomodate to catch the month change-overs
 year_month="$(date +"%Y-%B")"
+
+# Get the last modified time
+report_last_mod=$(curl -A "pw-post" -sSf "${mail_archive}${year_month}/thread.html" | grep Last-Modified)
+
+mailing_list_save_file=$(echo ".post_pw_${mail_archive}${year_month}" | sed -e "s@/@_@g" -e "s@:@_@g" -e "s,@,_,g")
+
+if [ -e "${HOME}/${mailing_list_save_file}" ]; then
+    last_read_date=$(cat "${HOME}/${mailing_list_save_file}")
+    if [ "$last_read_date" == "$report_last_mod" ]; then
+        echo "Last modified times match.  Skipping list parsing."
+        exit 0
+    fi
+else
+    touch "${HOME}/${mailing_list_save_file}"
+fi
+
 reports="$(curl -A "pw-post" -sSf "${mail_archive}${year_month}/thread.html" | \
          grep -i 'HREF=' | sed -e 's@[0-9]*<LI><A HREF="@\|@' -e 's@">@\|@')"
 if [ $? -ne 0 ]; then
@@ -132,3 +163,5 @@  echo "$reports" | while IFS='|' read -r blank link title; do
         send_post "${mail_archive}${year_month}/$link"
     fi
 done
+
+echo "$last_read_date" > "${HOME}/${mailing_list_save_file}"
diff --git a/series_db_lib.sh b/series_db_lib.sh
index c5f42e0..0635469 100644
--- a/series_db_lib.sh
+++ b/series_db_lib.sh
@@ -130,6 +130,17 @@  recheck_sync INTEGER
 EOF
         run_db_command "INSERT INTO series_schema_version(id) values (8);"
     fi
+
+    run_db_command "select * from series_schema_version;" | egrep '^9$' > /dev/null 2>&1
+    if [ $? -eq 1 ]; then
+        sqlite3 ${HOME}/.series-db <<EOF
+CREATE TABLE check_id_scanned (
+check_patch_id INTEGER,
+check_url STRING
+)
+EOF
+        run_db_command "INSERT INTO series_schema_version(id) values (9);"
+    fi
 }
 
 function series_db_exists() {
@@ -468,3 +479,17 @@  function set_recheck_request_state() {
 
     echo "UPDATE recheck_requests set recheck_sync=$recheck_state where patchwork_instance=\"$recheck_instance\" and patchwork_project=\"$recheck_project\" and recheck_requested_by=\"$recheck_requested_by\" and recheck_series=\"$recheck_series\";" | series_db_execute
 }
+
+function add_check_scanned_url() {
+    local patch_id="$1"
+    local url="$2"
+
+    echo "INSERT into check_id_scanned (check_patch_id, check_url) values (${patch_id}, \"$url\");" | series_db_execute
+}
+
+function check_id_exists() {
+    local patch_id="$1"
+    local url="$2"
+
+    echo "select * from check_id_scanned where check_patch_id=$patch_id and check_url=\"$url\";" | series_db_execute | grep "$url" >/dev/null 2>&1
+}